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Abstract 

Image  compression  usually  considers  the  minimization  of  storage  space  as  its  main  ob¬ 
jective.  It  is  desirable,  however,  to  code  images  so  that  we  have  the  ability  to  process 
the  resulting  representation  directly.  In  this  thesis  we  explore  an  approach  to  document 
image  compression  that  is  efficient  in  both  space  (storage  requirement)  and  time  (pro¬ 
cessing  flexibility).  A  representation  is  presented  in  which  component-level  redundancy 
is  removed  by  forming  a  prototype  library  and  component  location  table.  This  represen¬ 
tation  forms  a  basis  for  compression  and  provides  direct  access  to  image  components.  To 
generate  the  prototype  library,  a  new  clustering  approach  is  developed  which  is  suitable 
for  document  image  components.  The  distance  metric  is  based  on  a  character  degrada¬ 
tion  model  so  that  degraded  versions  of  the  same  character  will  be  grouped  together.  To 
achieve  a  lossless  representation  when  required,  the  residuals  are  encoded  efficiently  using 
a  structural  distance  ordering.  OCR  is  then  used  as  a  measure  of  readability  to  evaluate 
the  rate  distortion  tradeoff  for  lossy  compression.  A  set  of  algorithms  is  presented  for 
typical  document  processing  applications  which  operate  effectively  on  the  compressed 
representation.  Applications  demonstrated  include  subdocument  retrieval,  skew  estima¬ 
tion 
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Chapter  1 


Introduction 


Technological  advances  in  processing,  storage,  and  visnalization  have  made  it  possi¬ 
ble  to  maintain  large  nnmbers  of  docnments  in  digital  image  form  and  make  them 
accessible  over  networks.  In  order  to  do  this  effectively,  three  primary  concerns  mnst 
be  addressed.  The  hrst  is  document  size.  An  ASCII  representation  of  a  docnment 
page  can  easily  be  stored  in  2-3  KB,  whereas  a  typical  scanned  image  of  a  page  may 
reqnire  between  500  KB  and  2  MB.  If  we  are  to  maintain  docnments  in  image  form, 
an  efficient  compressed  representation  is  essential  for  both  storage  and  transmission. 

The  second  concern  is  providing  efficient  access  to  the  compressed  image.  Tra¬ 
ditional  compression  techniqnes  nsed  for  docnment  images  have  been  snccessfnl  in 
redncing  storage  reqnirements  bnt  do  not  provide  efficient  access  to  the  compressed 
data,  ft  is  desirable  to  nse  a  compression  method  that  makes  nse  of  a  strnctnred 
representation  of  the  data,  so  that  it  not  only  allows  for  rapid  transmission  bnt  also 
allows  access  to  varions  docnment  components  and  facilitates  processing  of  docnments 
withont  the  need  for  expensive  decompression. 

The  third  concern  is  that  of  readability.  Many  lossy  compression  and  progressive 
transmission  techniqnes  nse  resolntion  rednction  or  textnre-preserving  methods  that 
can  render  a  docnment  image  nnreadable.  ft  is  desirable  that  a  docnment  be  readable 
even  at  the  highest  levels  of  lossy  compression  and  at  the  start  of  a  progressive 
transmission.  The  highly  lossy  representation  can  then  be  angmented  by  snbseqnent 
transmissions  for  better  rendition.  This  is  preferred  to  a  scenario  in  which  the  highest 
resolntion  is  the  only  readable  resolntion.  In  this  thesis,  we  will  address  these  concerns 
for  scanned  images  of  machine-printed  docnments. 

The  task  of  compression,  organization,  and  transmission  of  information  is  not  new 
to  the  image  and  video  domains.  A  nnmber  of  techniqnes  have  been  developed  to  ad¬ 
dress  the  problems  of  image  qnality  and  compressed-domain  access  bnt  the  solntions 
have  had  very  little  general  application  becanse  of  the  domain-specihc  reqnirements. 
Likewise,  docnment  images  have  specihc  characteristics  and  reqnirements  which  re¬ 
qnire  special  attention  when  considering  approaches  to  compression.  Two  general 
qnestions  which  arise  immediately  are: 

1.  What  are  the  characteristics  of  a  docnment  image  which  can  be  exploited  to 
beneht  compression  and  compressed-domain  processing? 
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2.  What  has  been  done  before  that  can  be  extended  to  the  docnment  image  do¬ 
main? 

We  will  address  these  qnestions  in  the  next  snbsections  by  reviewing  standard  com¬ 
pression  techniqnes,  examining  docnment  image  characteristics,  and  reviewing  work 
done  on  docnment  image  compression. 

1.1  Image  compression 

The  initial  breakthronghs  in  the  compression  of  one-dimensional  signals  [109]  were 
easily  extended  to  the  image  domain  by  concatenating  image  rows  or  columns  into 
a  single  stream.  Techniqnes  snch  as  Shannon-Fano  coding  [95]  and  Hnffman  cod¬ 
ing  [26,  29,  55,  106,  118,  119]  nse  rednndancy-rednction  mechanisms  which  resnlt  in 
shorter  codes  for  more  freqnently  appearing  samples.  It  is  necessary  to  scan  the  data 
samples  in  order  to  determine  their  probabilities  of  occnrrence  and  create  an  appro¬ 
priate  code.  Adaptive  variations  of  these  techniqnes  initially  assume  eqnal  probability 
for  all  samples  and  calcnlate  snbseqnent  probability  measnres  based  on  a  hxed  win¬ 
dow  length  prior  to  the  sample  of  interest.  This  allows  local  changes  in  probability 
measnrements  and  achieves  higher  global  compression.  Rnn-length  coding  [38]  is  an¬ 
other  rednndancy-rednction  coding  method  where  in  a  scan-line  each  rnn  of  symbols 
is  coded  as  a  pair  that  specihes  the  symbol  and  the  length  of  the  rnn. 

While  most  rednndancy-rednction  methods  are  lossless,  other  arbitrarily  lossy 
coding  methods  have  achieved  higher  levels  of  compression.  Transform  coding  [44], 
snbband  coding  [61,  77,  105,  114],  vector  qnantization  [61,  62,  66],  and  predictive 
coding  [1,  3,  22,  31,  46,  57,  59,  76,  83,  97]  are  among  the  ones  that  have  achieved  high 
levels  of  lossy  compression. 

The  major  transform  coding  techniqnes  inclnde  cosine/sine  [2,  44],  Fonrier  [5,  44], 
Hadamard  [27,  44],  Haar  [27,  44],  slant  [44],  and  principal-component  (Karhnnen- 
Loeve)  transforms  [6,  44].  All  transform  coding  based  compression  algorithms  are 
pixel-based  approaches  which  decompose  a  signal  nnto  an  orthonormal  basis  to  achieve 
energy  compaction.  Lossy  compression  is  then  achieved  by  coding  the  high  energy 
components  and  leaving  ont  the  low  energy  ones.  While  some  transformations  snch 
as  the  Hadamard  and  Haar  transforms  can  be  performed  relatively  qnickly,  other 
transforms  are  compntationally  intensive  and  either  reqnire  dedicated  hardware  or 
restrictions  snch  as  limiting  the  size  of  the  transform  to  a  power  of  two. 

In  snbband  coding,  as  the  name  implies,  different  freqnency  bands  are  treated 
differently  to  achieve  higher  hdelity  or  higher  compression.  As  with  transform  cod¬ 
ing,  the  signal  is  transformed  to  an  alternate  domain,  bnt  instead  of  leaving  ont  a 
component,  it  is  sampled  at  a  lower  resolntion  while  more  important  components 
are  qnantized  at  higher  resolntion.  In  effect,  at  each  snbband  the  application  of  a 
different  qnantization  strategy  is  nsed  to  preserve  hdelity  and  achieve  compression. 

Vector  qnantization  is  a  hybrid  of  statistical  analysis  and  pattern  recognition 
with  a  large  nnmber  of  weU-established  techniqnes.  This  method  hrst  divides  an 
image  into  blocks  which  are  then  serialized  into  large-dimensional  vectors.  Fach 
vector  is  treated  as  a  sample  in  a  high-dimensional  space  and  this  space  is  partitioned 
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into  subspaces,  called  classes.  For  each  class  of  vectors,  a  prototype  is  created  that 
closely  resembles  the  vectors  which  belong  to  the  class.  Vector  quantization  shifts 
the  partition  boundaries  by  calculating  a  distance  measure.  Errors  between  in-class 
vectors  and  the  corresponding  prototypes  over  all  classes  are  minimized  to  provide 
optimum  prototype  placement.  High  levels  of  compression  are  achieved  by  coding 
every  vector  that  belongs  to  a  given  class  by  a  reference  to  the  prototype  vector. 
The  level  of  lossiness  is  then  dependent  on  the  distance  measure  and  the  number  of 
prototype  vectors. 

Predictive  coding,  such  as  the  differential  pulse  code  modulation  (DPCM)  patented 
by  Cutler  [22],  predicts  the  next  sample  based  on  the  previously  coded  samples  and 
codes  the  error  between  the  prediction  and  the  actual  sample.  The  main  objective  of 
DPCM  is  to  narrow  down  the  range  of  coded  samples.  However,  images  are  not  sta¬ 
tionary  in  texture,  and  some  predictors  may  work  in  certain  areas,  but  not  in  others. 
Hence,  the  concept  of  adaptive  predictors  [3,  3f ,  83]  was  introduced  so  that  the  coder 
and  decoder  could  choose  from  a  set  of  predictors  at  the  cost  of  transmitting  the  pa¬ 
rameters  needed  for  the  prediction.  This  method  of  coding  needs  to  limit  prediction 
error  for  efficient  coding  and  achieves  lossless  compression  in  coding  the  entire  error; 
lossy  coding  is  achieved  by  limiting  the  range  of  the  error  value. 

More  recently,  second-generation  compression  algorithms  based  on  human  visual 
behavior  [68]  have  been  proposed  which  have  the  potential  for  much  higher  com¬ 
pression  ratios.  Also,  developments  in  user-specihed  wavelet  transforms  [f8,  68]  can 
be  used  to  address  compression  requirements  for  specihc  image  types.  Fractal  cod¬ 
ing  [117],  which  is  also  called  chaos  coding,  is  an  example  of  progress  in  the  area 
of  two-dimensional  redundancy  reduction.  All  of  these  compression  techniques  are 
pixel-based  and  they  perform  best  for  textured  images. 

1.2  Document  characteristics 

Document  images  are  scans  of  documents  which  are  in  most  cases  pseudo-binary 
and  rich  in  textual  content.  Informally,  we  can  dehne  document  images  as  images 
that  contain  components  that  resemble  the  symbols  of  a  language.  Many  documents 
look  like  those  shown  in  Figure  1.1  and  are  represented  adequately  by  binary  images 
produced  by  scanning  at  a  resolution  of  300  dots  per  inch  (dpi).  They  tend  to  be 
highly  structured  in  terms  of  layout,  and  have  signihcant  redundancy  in  the  symbols 
which  occur  in  them. 

Since  a  document  can  be  used  in  a  large  number  of  ways,  an  important  consid¬ 
eration  is  how  the  image  is  affected  by  compression.  For  example,  if  we  intend  that 
the  document  ultimately  be  read  by  humans,  it  is  necessary  for  compression  schemes 
to  preserve  the  shapes  of  the  components  so  that  they  are  recognizable  by  the  reader 
after  retrieval.  If  the  document  must  be  reproduced  in  near-original  form,  techniques 
which  reduce  resolution  may  not  be  acceptable.  Thus  the  required  resolution  is  de¬ 
pendent  on  the  task;  some  resolutions  may  leave  the  document  readable,  but  may 
destroy  hue  detail. 

As  is  well  known,  the  performance  of  conventional  compression  algorithms  (such 
as  those  based  on  transform  coding  [44],  vector  quantization  [30],  fractal  compression 
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Figure  1.1:  Examples  of  binary  document  images  of  the  types  usually  found  in  a 
document  image  database.  These  are  images  from  the  University  of  Washington 
document  database  [32],  cropped  (automatically)  to  the  main  body  of  the  text. 
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[12],  pattern  matching  and  snbstitntion  [10,  64],  and  other  approaches)  depends  on  the 
types  of  images  being  compressed,  and  on  their  textnre  and  content  characteristics. 
Algorithms  often  do  well  on  some  classes  of  images  and  not  so  well  on  others.  Since 
document  images  differ  signihcantly  from  scene  images,  it  is  reasonable  to  assnme 
they  will  beneht  from  a  specialized  compression  scheme. 

1.3  Previous  work  on  document  image  compression 

Document  image  compression  was  hrst  recognized  as  a  special  case  of  compression 
in  the  f970s  when  increased  use  of  facsimile  machines  promised  potential  growth  in 
document  scanning  and  transmission.  The  international  communication  standards 
body  then  set  out  to  create  a  standard  for  use  in  facsimile  transmission.  In  1980 
the  hrst  comprehensive  standard  was  released  as  the  CCITT  Group  3  transmission 
standard  [41],  followed  by  CCITT  Group  4  [15,  36,  85].  Both  standards  were  originally 
designed  for  digital  transmission  rather  than  for  analog  phone  lines,  but  nevertheless 
have  become  standards  for  facsimile  transmission.  Both  schemes  use  a  run-length- 
based  approach  where  the  white  and  black  runs  are  pre-coded  in  the  compression 
algorithm.  An  extension  to  dual  scan-line  compression  exploits  coherence  between 
successive  scan  lines  by  inserting  references  to  the  previous  scan-line’s  black  runs.  In 
Group  4,  higher  compression  is  achieved,  largely  due  to  the  omission  of  various  error- 
correcting  codes  which  were  intentionally  included  in  the  Group  3  standard.  These 
methods  are  clearly  sufficient  for  transmission  of  facsimile  since  a  facsimile  scanner 
sends  and  receives  in  scan-line  order. 

•Joint  Binary  Image  experts  Group  (JBIG)  is  a  recent  CCITT  standard.  JBIG 
performs  compression  on  binary  images  and  provides  a  lossy  representation  suited  for 
compression  and  transmission.  Context  modeling  forms  the  basis  of  its  coding  [8f ,  82], 
At  the  lowest  levels,  JBIG  uses  template  models  and  adaptive  arithmetic  coding  to 
encode  predictions.  “Layers”  of  the  image  are  represented  at  reduced  resolutions  to 
provide  progressive  transmission.  Exception  pixels  are  used  to  preserve  overall  image 
layout  and  structure;  this  works  well  for  textured  dithered  images,  but  loses  hdelity 
when  small-scale  structure  is  essential,  as  in  the  case  of  text  documents.  JBIG, 
despite  its  use  of  context-based  coding,  is  unable  to  organize  information  to  be  used 
for  compressed-domain  processing.  This  is  mostly  due  to  its  use  of  arithmetic  coding 
directly  on  the  pixels.  The  output  of  the  arithmetic  coder  is  highly  complex  and 
cannot  be  easily  deciphered  for  compressed-domain  processing.  The  only  favorable 
characteristic  of  JBIG  is  its  use  of  intermediate  representations  for  lossy  compression 
and  progressive  transmission. 

Recent  work  by  Howard  [37]  performs  context-based  arithmetic  coding  on  the  bit 
map  image  based  on  predictive  coding  of  “clairvoyant”  [37,  112]  pixels  and  previously 
coded  pixels.  Clairvoyant  pixels  are  pixels  belonging  to  prototype  images;  in  Howard’s 
coding  scheme  they  contribute  to  knowledge  of  the  current  pixel’s  distribution  which 
can  be  used  for  better  coding.  This  is  still  a  pixel-based  approach  and  does  not 
provide  for  compressed-domain  processing.  While  Howard  proposes  a  methodology  for 
scalable  lossy  compression,  his  approach  lacks  a  hierarchical  representation  suitable 
for  progressive  transmission. 
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An  important  approach  to  document  compression,  snggested  by  Ascher  and  Nagy 
[9]  and  later  formalized  by  Witten  et  al.  [43,  110,  111,  112],  is  based  on  the  repetitive 
natnre  of  text  components.  In  a  departnre  from  pixel-level  to  symbol  level  coding, 
marks  fonnd  within  a  docnment  are  coded.  The  method  is  very  similar  to  pattern¬ 
matching  and  snbstitntion  algorithms  [10,  47,  64,  69,  70,  85,  113].  It  is  based  on  the 
measnrement  of  similarity  between  symbols  within  an  image.  In  this  way  rednndancies 
in  symbol  shapes  are  identihed,  resnlting  in  two-dimensional  rednndancy  redaction 
versns  the  traditional  one-dimensional  rednction. 

Ascher  and  Nagy’s  approach  considered  a  stream  of  bitmap  images  which  repre¬ 
sented  individnal  characters.  They  proposed  an  algorithm  which  creates  a  dynamic 
library;  an  nncataloged  character  gets  added  to  the  library  and  a  character  that  has 
already  been  seen  gets  replaced  by  an  index  to  a  library  item.  The  library  is  a  col¬ 
lection  of  bitmap  images.  By  keeping  the  library’s  size  limited,  an  adaptive  symbol 
compression  method  is  achieved;  changes  in  the  symbol  stream  are  compensated  by 
creating  new  library  prototypes  and  throwing  ont  old  and  nnnsed  ones. 

Witten  et  al.  expanded  on  this  approach  to  address  extraction  of  image  compo¬ 
nents,  indexing  their  locations  within  the  page  and  compressing  those  indices,  and 
coding  the  residnals  left  after  inexact  replacement  of  components  by  library  proto¬ 
types.  This  work  made  contribntions  to  prototype  generation,  symbol  index  com¬ 
pression,  and  residnal  coding.  Connected  component  analysis  is  performed  on  each 
bitmap  image.  The  components  are  then  clnstered  by  a  symbol  matching  method 
snch  as  Pratt’s  combined  symbol  matching  [85],  Holt  and  Xydeas’  “weighted  and- 
not”  method  [36],  Holt’s  combined  size- independent  strategy  [35],  and  Johnsen’s 
generic  pattern  matching  and  snbstitntion  [47].  Another  method  of  matching,  called 
compression-based  template  matching  [43]  was  snggested  by  Inglis  and  Witten  and 
was  later  improved  npon  by  Zhang  and  Danskin  [116];  it  was  based  on  mntnal  infor¬ 
mation  between  the  component  and  the  prototype.  The  locations  of  the  components 
are  ordered  in  natnral  reading  order,  indexed  with  respect  to  each  other  (a  compo¬ 
nent  is  referenced  by  its  npper  left  corner  with  respect  to  the  bottom  right  corner 
of  the  previons  component),  and  compressed.  The  residnal  map  is  kept  for  lossless 
compression  since  clnstering  generalizes  the  inpnt  patterns  and  there  exist  differences 
between  the  components  and  prototypes  which  may  affect  readability.  The  residnal 
map  is  not  compressed,  bnt  the  original  image  is  compressed;  the  image  composed  of 
prototypes  is  nsed  as  context  in  the  context-based  arithmetic  coding. 

This  method  satishes  onr  concerns  to  a  large  extent  bnt  falls  short  in  some  areas. 
Most  of  the  shortfalls  lie  in  the  processing  domain.  The  hrst  drawback  is  that  direct 
access  to  the  components  is  lost,  or  at  least  difficnlt,  when  differential  indexing  is 
nsed.  The  second  drawback  is  the  lack  of  hierarchical  coding  to  allow  for  a  scalable 
intermediate  representation.  Only  two  levels  of  representation  are  available  (with  and 
withont  residnals),  which  does  not  make  this  method  snitable  for  lossy  compression 
and  progressive  transmission.  The  third  drawback  is  the  failnre  to  address  readability. 
They  did  not  ntilize  a  noise  model  to  estimate  image  qnality  and  derive  a  scalable 
representation.  They  were  concerned  only  with  compression,  which  they  achieved 
rather  well. 
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1.4  Compressed-domain  processing  of  document  images 

Work  on  skew  estimation  like  that  of  Baird  [11]  has  been  shown  to  be  efficient  and 
accnrate.  However,  detection  of  alignment  featnre  locations  is  compntationally  de¬ 
manding  since  the  volnme  of  data  is  large,  ft  wonld  be  extremely  desirable  if  the 
detection  of  alignment  featnres  conld  be  done  in  the  compressed  domain.  In  Spitz’s 
work  [100],  the  pass  codes  in  CCITT  Gronp  4  compressed  images  were  nsed  for  Baird’s 
alignment  featnres  (called  hdncial  points)  to  estimate  a  skew  angle.  Pass  codes  are 
part  the  Gronp  4  specihcation  which  mark  the  locations  where  the  previons  scan  line 
has  a  rnn  of  black  pixels  and  the  cnrrent  scan  line  does  not.  The  pass  codes  are  heavily 
popnlated  at  the  bottoms  of  characters  and  provide  a  basis  for  skew  estimation.  The 
occasional  pass  codes  generated  by  descenders,  dots,  etc.  do  not  severely  degrade  the 
performance  of  the  skew  estimation.  However,  skew  correction  is  a  non-trivial  task 
with  a  rnn-length  encoding  based  algorithm. 

Scanning  and  recognizing  barcodes  nsing  hand-held  scanners  has  been  qnite  snc- 
cessfnl,  bnt  the  problem  of  recognizing  barcodes  in  scanned  images  has  received  little 
attention.  In  the  work  done  by  Maa  [65],  detection  of  barcodes  is  performed  on 
CCITT  Gronp  4  representations.  The  basis  for  Maa’s  work  is  that  since  bars  and 
spaces  are  npright  and  perfectly  aligned,  Gronp  4  compression  will  nse  vertical  coding 
so  as  to  rednce  vertical  rednndancy.  Thns  in  all  scan  lines  except  the  hrst,  every  rnn 
of  bars  and  spaces  will  be  represented  by  the  same  code.  This  will  resnlt  in  rnns  of  this 
code  at  image  locations  where  there  exist  barcodes  with  no  interfering  patterns.  With 
added  noise,  only  one  or  two  other  codes  were  nsnally  fonnd  in  these  rnns  and  they 
were  still  nsable  for  detection  of  barcodes.  An  86%  performance  rate  was  reported. 

For  pnrposes  of  docnment  image  retrieval,  it  is  desirable  to  index  on  distinctive 
image  characteristics.  One  class  of  snch  characteristics  which  appear  in  docnment  im¬ 
ages  are  logos.  Logos  are  dehned  by  Spitz  [fOf]  as  spatial  arrangements  of  strnctnres. 
He  nsed  the  hdncial  points  that  he  had  applied  to  skew  estimation,  and  considered 
them  to  be  a  series  of  points  locally  aligned  to  specihc  signatnres.  A  signatnre  rep¬ 
resentative  of  a  logo  prototype  can  be  determined  and  compared  with  signatnres 
extracted  from  the  image.  Localizing  the  logo  was  not  considered,  bnt  this  is  an  easy 
problem  since  a  prototype  logo  is  of  a  known  size  and  the  signatnre  determined  from 
the  image  shonld  only  depend  on  that  size.  Experiments  on  a  small  set  of  images 
were  reported  to  resnlt  in  very  high  recognition  rates. 

Recently,  docnment  image  matching  has  been  considered  by  a  number  of  re¬ 
searchers  [89,  96,  ff5].  Recent  work  by  Hnll  [40]  addresses  this  problem  by  nse 
of  compressed-domain  processing.  As  in  Spitz’s  work  on  skew  estimation  and  logo 
detection,  Hnll  considers  the  locations  of  pass  codes  as  featnre  points  and  compares 
two  sets  of  featnre  points  for  a  possible  match.  These  featnre  points  are  contained  in 
a  bonnding  box,  and  in  Hnll’s  work  he  determines  the  best  choice  of  a  bonnding  box 
to  yield  a  high  match  score.  Given  two  sets  of  featnre  points,  Hnll  nsed  the  Hansdorff 
distance  [42],  which  measnres  the  amonnt  of  pertnrbation  needed  to  overlap  pairs  of 
points  in  a  pair  of  images,  and  reported  a  high  detection  rate.  Efficient  measnrement 
of  the  Hansdorff  distance  has  been  stndied  by  Rncklidge  [88]. 

ft  wonld  be  desirable  to  develop  a  docnment  compression  method  that  can  ad- 


7 


dress  a  larger  amount  of  compressed-domain  processing.  In  doing  this,  we  can  take 
advantage  of  the  fact  that  the  features  used  in  such  tasks  as  skew  estimation,  correc¬ 
tion,  and  matching  are  dependent  on  the  document  image  symbols,  so  that  a  method 
allowing  access  to  these  symbols  should  facilitate  document  image  processing. 

1.5  Symbolic  compression 

Symbolic  compression  is  a  method  of  coding  similarly-shaped  marks  in  an  image 
with  reference  to  a  set  of  templates.  In  doing  so,  individual  marks  must  hrst  be 
segmented  out.  For  binary  document  images,  segmentation  is  typically  a  matter  of 
determining  connected  components  of  black  pixels  in  the  image.  Perfect  segmenta¬ 
tion  is  not  necessary  since  the  residual  bits  which  remain  after  replacing  a  component 
with  a  template  are  also  coded.  The  segmented  components  are  clustered  to  cap¬ 
ture  their  redundancies.  Various  methods  of  clustering,  such  as  simple  exclusive-or 
and  compression-based  template  matching  [35,  112],  can  be  used.  Once  all  symbols 
within  the  image  have  been  considered  for  clustering,  the  cluster  centers  can  serve  as 
representative  templates  or  prototypes  for  the  clusters.  Some  clusters  may  need  to  be 
removed  because  of  inadequate  membership,  and  some  clusters  may  exhibit  large  in- 
class  variances  which  result  in  abnormally  large  residuals.  The  prototypes  are  stored 
in  a  library.  The  locations  of  the  symbols  must  be  recorded  so  that  a  highly  compact 
representation,  based  on  prototypes  only,  is  immediately  available.  The  difference  be¬ 
tween  this  symbolic  representation  and  the  original  image  is  a  residual  image  which 
has  far  fewer  foreground  pixels  than  before;  this  implies  a  high  compression  ratio. 
Context-predictive  coding  of  the  residual  image  further  improves  the  compression 
ratio. 

Residual  coding  is  the  portion  of  the  compression  algorithm  which  encodes  the 
residuals.  To  preserve  symbol  access  for  compressed-domain  analysis  [50,  53],  it  is 
necessary  to  code  the  error  on  a  per-symbol  basis  rather  than  coding  it  for  the  entire 
image.  Using  a  structural  model,  it  is  possible  to  order  the  error  pixels  based  on 
their  structural  importance.  The  structural  order,  derived  in  part  from  the  work  of 
Kanungo  et  ah  [48],  is  based  on  the  observation  that  the  degradation  of  symbols  in  a 
document  image  from  the  effects  of  copying  and  scanning  occurs  mostly  around  the 
edges  of  symbols.  Nagy  et  ah  also  observed  the  near-edge  degradation  of  document 
images  by  considering  the  effects  of  point  spread  functions  on  the  quality  of  a  scanned 
image  [72].  This  type  of  degradation  does  not  adversely  affect  to  the  readability  of  the 
document  and  does  not  provide  structural  information.  Conversely,  it  is  observed  that 
pixels  farther  from  edges  have  dramatic  effects  on  the  readability  of  symbols  and  are 
structurally  more  important.  By  ordering  the  error  pixels  in  distance-to-edge  order, 
we  can  exploit  their  statistical  characteristics  by  putting  the  most  probable  error 
pixels  last,  and  we  can  provide  structural  coding,  by  putting  pixels  which  have  the 
potential  to  greatly  impact  the  correct  recognition  of  characters  hrst.  This  paves  the 
way  for  compression  by  packing  the  energy  in  the  error  stream,  for  lossy  compression 
by  terminating  the  error  stream  at  an  index  calculated  by  the  signal-to-noise  ratio, 
and  for  progressive  transmission  by  interleaving  the  error  signals  for  a  set  of  image 
marks  based  on  their  distance  order  [49]. 
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To  perform  successful  symbolic  compression,  we  must  perform  a  segmentation 
that  decomposes  the  image  into  components  that  repeat.  These  components  then 
need  to  be  clustered  so  as  to  generate  a  set  of  representative  prototypes.  The  quality 
of  performance  on  these  tasks,  along  with  the  quality  of  the  image,  can  drastically 
affect  the  overall  compression  performance.  In  our  approach  to  this  problem  we 
concentrate  on  bi-level  document  images  that  have  large  textual  image  content.  This 
is  not  a  strict  requirement,  but  allows  us  to  measure  the  performance  of  our  document 
image  compression  scheme  effectively.  Work  done  by  Sennhauser  and  Ohnesorge  [94] 
and  Sauvola  and  Pietikainen  [93]  proves  that  block-based  compression  gives  better 
performance  on  document  images  that  contain  images.  It  is  also  useful  to  consider 
algorithms  such  as  Etemad  et  ah  [25],  Anatoncopoulos  et  ah  [7],  Jain  et  ah  [45], 
Pavlidis  et  al.  [80],  and  others  [28,  79,  t08]  to  provide  image  segmentation  into 
regions  of  different  types  so  that  different  compression  schemes  may  operate  on  the 
regions.  We  assume  that  there  exist  enough  components  in  the  text  regions  that  can 
be  extracted  by  connected  component  analysis  and  that  can  be  recognized  correctly 
by  a  human  reader.  The  fact  that  the  components  are  repetitive  in  nature  makes  them 
a  rich  source  for  compression.  Also  the  fact  that  document  images  contain  a  large 
amount  of  textual  material  laid  out  in  a  two-dimensional  space  makes  them  ideal  for 
a  two-dimensional  redundancy  reduction  scheme.  Since  the  useful  information  in  the 
textual  portion  is  contained  in  the  components,  access  to  the  components  provides 
access  to  that  information. 

There  is  a  strong  case  for  the  use  of  symbolic  compression  in  document  image 
coding,  in  addition  to  space  savings.  With  the  use  of  only  about  one  percent  of 
the  information  in  the  original  image,  it  is  possible  to  code  the  dominant  symbol 
shapes  and  their  locations  in  the  image.  Since  most  of  the  information  in  a  document 
is  in  the  symbols,  it  is  possible  to  perform  compressed-domain  analysis  by  using  a 
symbolic  compressed  representation.  Using  only  this  representation,  though  it  is  not 
perfect,  it  is  possible  to  perform  a  large  number  of  document  analysis  tasks  [75]  such  as 
skew  estimation/correction  [It,  13,  63,  73,  84],  Optical  Character  Recognition  (OCR) 
[71],  layout  analysis  [74,  80],  and  so  on.  Although  some  work  has  been  done  on  the 
processing  of  compressed  document  images  [65,  100,  101]  the  outlook  for  applying 
such  techniques  to  pixel-based  methods  does  not  appear  promising. 

The  organization  of  this  dissertation  is  as  follows.  Chapter  2  describes  a  data 
representation  model  which  is  designed  based  on  requirements  related  to  compres¬ 
sion  and  processing.  This  representation  not  only  facilitates  the  derivation  of  key 
components  of  the  compression  algorithm  but  also  streamlines  the  design  and  imple¬ 
mentation  of  a  number  of  compressed-domain  tasks.  This  chapter  provides  details 
about  the  general  approach,  the  compressed  representation  in  terms  of  streams,  com¬ 
pression/decompression  of  the  streams,  and  modules  to  integrate  various  tasks  and 
data  structures. 

Chapter  3  presents  an  argument  for  the  use  of  a  special  class  of  clustering  algo¬ 
rithms  in  the  compression  algorithm.  In-depth  discussion  of  clustering  algorithms  is 
given  along  with  a  new  method  that  facilitates  symbolic  compression.  Motivation 
and  examples  are  given  to  support  our  choices. 

Chapter  4  discusses  various  methods  of  performing  residual  coding.  The  residue  is 
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a  data  structure  used  in  the  compression  algorithm  to  allow  for  lossless  and  scalable 
lossy  compression.  In  this  chapter  a  number  of  methods  are  pursued  which  provide 
desirable  lossy  representations  along  with  higher  compression  factors.  A  method  of 
structural  coding  is  also  presented  which  provides  improvements  in  both  lossy  and 
lossless  representation.  This  coding  has  an  immediate  application  to  transmission 
problems.  Analytical  and  empirical  results  are  given. 

Chapter  5  analyzes  the  rate-distortion  trade-off  of  the  residual  codes  mentioned 
in  Chapter  4.  This  trade-off  is  applicable  to  progressive  transmission  and  lossy  com¬ 
pression  problems.  An  OCR-based  distortion  measure  is  used  to  evaluate  our  code 
relative  to  competing  methods. 

Chapter  6  describes  a  set  of  tasks  which  we  were  able  to  perform  successfully  in 
the  compressed  domain.  In  this  chapter  we  review  the  algorithms  that  were  used 
along  with  performance  results  and  benchmark  outputs  (when  applicable). 

Chapter  7  discusses  extensions  of  our  approach  to  grayscale  documents,  graphics, 
networked  databases,  and  hypertext  documents.  Some  preliminary  results  are  given, 
but  most  of  the  extensions  are  still  speculative. 

1.6  Summary  of  contributions 

The  following  is  a  list  of  contributions  made  in  this  dissertation  along  with  brief 
explanations. 

Compressed  data  representation:  The  compression  algorithm  creates  data 
structures  that  have  specihc  uses  and  vary  in  characteristics  and  statistics.  Our  rep¬ 
resentation  of  the  data  contributes  to  its  efficient  use  for  compression  and  processing. 

Pattern  clustering:  One  major  component  of  the  compression  algorithm  is 
clustering  of  the  set  of  patterns.  A  clustering  approach  is  pursued  which  matches 
the  patterns  and  classihes  them  based  on  their  match  scores.  We  introduce  a  new 
pattern  matching  method  that  attempts  to  preserve  structural  properties  by  weighting 
component  pixels  based  on  their  distances  to  the  closest  edge  of  the  prototype. 

Residual  coding:  After  substitution  of  cluster  prototypes  for  the  patterns,  we 
code  the  residuals.  Since  the  residuals  constitute  the  majority  of  the  compressed  rep¬ 
resentation,  we  develop  a  hierarchy  that  exhibits  desirable  compression,  lossy  com¬ 
pression,  and  progressive  transmission  qualities.  Three  coding  methods  are  derived 
from  a  novel  idea  based  on  structural  components  of  the  residuals.  A  unique  advan¬ 
tage  of  these  codings  is  the  reduction  of  image  pattern  variability  rather  than  the 
traditional  resolution  reduction  to  achieve  lossy  representations. 

Rate  distortion:  We  pursue  a  functional  approach  to  the  measurement  of  dis¬ 
tortion  as  it  applies  to  compression  and  transmission  issues.  We  derive  a  common 
platform  for  testing  and  evaluation. 

Compressed-domain  processing:  The  tasks  that  can  be  performed  on  images 
when  they  are  represented  in  the  compressed  domain  are  limited  for  traditional  image 
compression  techniques.  As  mentioned  previously,  there  has  been  some  work  done  in 
this  area  but  the  outlook  for  the  development  of  a  general  paradigm  does  not  look 
promising.  In  our  work  we  promote  compressed-domain  processing  not  only  to  achieve 
higher  compression  but  also  to  improve  application-specihc  processing  performance. 
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The  contribution  of  this  idea  is  highly  philosophical  in  that  a  true  compressor  should 
determine  redundancy  in  a  way  that  reflects  an  underlying  message.  That  message 
should  be  accessible  since  its  constituents  are  redundant  and  were  captured  during 
compression.  A  noise  source  (uncorrelated  white  noise)  does  not  carry  any  message, 
hence  it  cannot  be  compressed;  anything  that  can  be  compressed  should  carry  a 
message. 

Some  preliminary  work  has  been  done  on  extending  our  ideas  to  grayscale  docu¬ 
ment  image  compression.  Other  future  work  entails  the  extension  of  our  techniques  to 
other  media  types  in  order  to  arrive  at  a  hyperdocument  representation  suitable  for 
transmission  and  compressed-domain  processing.  Graphics  and  network  extensions 
are  not  pursued  here  but  appear  quite  promising. 
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Chapter  2 


Basic  compression  and  data  representation 


To  address  various  aspects  of  compression,  organization,  and  transmission  in  a  uni¬ 
form  way,  we  need  to  provide  a  data  representation  model.  Based  on  the  requirements 
outlined  in  Chapter  1,  we  have  chosen  to  use  a  model  based  on  a  pattern  matching  and 
substitution  algorithm,  but  we  need  to  analyze  the  available  information  to  arrive  at 
a  suitable  organization.  This  is  desirable  for  a  number  of  reasons.  One  reason  is  that 
it  facilitates  further  research  and  improvements,  but  more  immediate  consequences 
are  identihcation  of  key  algorithms,  their  interaction  in  a  unihed  representation,  and 
preservation  of  compressed-domain  analysis. 

Use  of  pattern  matching  and  substitution  for  compression  requires  several  levels  of 
information.  Since  there  exists  a  well-dehned  set  of  patterns  within  a  document  image 
we  would  like  to  organize  the  patterns  so  that  they  are  accessible  to  document  analysis 
tasks.  Once  an  organization  is  derived,  we  can  consider  what  are  the  constituent 
processes  that  yield  the  necessary  data.  The  most  important  aspect  of  the  data 
representation  is  its  use  for  subsequent  processing.  Several  levels  of  representation 
need  to  be  addressed  for  efficient  access.  Some  algorithms  require  knowledge  of  the 
positions  of  characters  (skew  estimation/correction)  while  other  require  knowledge 
of  their  shapes  (keyword  searching).  A  representation  thus  needs  to  separate  these 
types  of  information.  There  exist  several  levels  of  data  types,  and  their  successful 
integration  is  crucial  to  effective  application-level  processing. 

This  chapter  is  organized  as  follows.  We  will  discuss  our  general  approach  to 
compression  in  Section  2.1  and  our  representation  of  the  document  page  in  Section 
2.2  by  describing  its  organization  in  terms  of  data  streams.  We  discuss  in  Section  2.3 
the  use  of  these  streams  for  compression  and  decompression.  We  give  some  examples 
in  Section  2.4  followed  by  a  summary  in  Section  2.5. 

2.1  Approach 

The  hrst  step  in  our  symbolic  document  image  compression  method  is  to  hnd  an 
initial  set  of  patterns  in  the  image  which  can  be  used  to  form  a  library.  In  the  case  of 
Latin  text,  performing  a  connected  component  analysis  on  the  binary  image  provides 
a  reasonable  starting  set.  For  connected  scripts  or  text  in  which  the  basic  units  are 
disconnected,  more  extensive  segmentation  would  be  necessary. 

A  small  portion  of  a  typical  document  is  shown  in  Figure  2.1a,  and  the  bounding 
boxes  of  the  connected  components  are  shown  in  Figure  2.1b.  Because  these  patterns 
tend  to  appear  repeatedly  in  the  image,  they  form  a  basis  for  compression.  In  cases 
where  multiple  characters  touch  to  form  a  single  component,  or  where  a  single  char¬ 
acter  is  split  into  multiple  components,  representing  them  as  a  new  component  lowers 
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Figure  2.1:  a)  A  sample  document  image; 
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the  compression  factor  only  slightly.  If  salt  and  pepper  noise  is  present,  it  may  give 
rise  to  small  components;  this  will  reduce  compression  but  will  have  little  or  no  effect 
on  the  readability  of  the  document. 

We  next  treat  each  component  as  an  observation  and  try  to  determine  a  best 
set  of  classes  (clusters)  of  the  components  and  to  choose  a  prototype  image  for  each 
class.  After  all  observations  have  been  processed,  the  prototype  map  typically  looks 
like  the  one  shown  in  Figure  2.2.  The  shapes  shown  in  Figure  2.2  resemble  Fnglish 
characters  because  of  the  primarily  Fnglish  content  of  the  original  document.  For 
a  document  rich  in  mathematical  symbols,  some  of  the  prototypes  would  resemble 
mathematical  symbols,  and  similarly  for  non-Latin  languages  the  prototypes  would 
capture  their  symbol  content.  Clustering  algorithms  with  different  characteristics 
affect  the  eventual  symbol  representation  which  needs  special  attention.  A  number 
of  algorithms  are  considered  and  discussed  in  Chapter  3. 

Fach  cluster  of  components  is  represented  by  a  prototype.  Depending  on  the 
sample  space,  there  will  exist  some  amount  of  variability  in  the  clusters.  For  some 
clusters,  the  amount  of  this  variation  may  be  large  enough  that  some  of  the  compo¬ 
nents  differ  signihcantly  from  the  prototype  and  extra  information  must  be  recorded 
to  remedy  this.  A  residual  map,  the  difference  between  a  given  component  and  its 
prototype,  is  preserved  and  used  to  recover  the  components  in  a  lossless  form  when 
necessary.  Fxamples  of  a  component,  a  prototype,  and  the  corresponding  residual 
map  are  shown  in  Figure  2.3.  In  addition  to  coding  the  prototype,  we  code  the  resid¬ 
ual  map  separately  so  that  access  to  individual  symbols  can  be  achieved  by  access  to 
their  residual  maps.  The  overall  encoding  scheme  is  shown  in  Figure  2.4. 

In  our  work  special  attention  is  given  to  performing  document  analysis  tasks  in 
the  compressed  domain,  without  full  decompression.  In  many  situations  only  parts 
of  the  encoded  information  pertaining  to  the  given  task  require  decompression. 

Our  approach  makes  a  strong  case  for  the  use  of  symbolic  compression  in  document 
image  coding.  We  will  show  that  it  is  possible  to  code  the  dominant  symbol  shapes 
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Figure  2.2:  Typical  cluster  prototypes  from  a  textually  rich  image. 
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Figure  2.3:  a)  A  component;  b)  a  prototype;  c)  the  residual  map. 

and  their  locations  within  the  image  using  only  about  one  percent  of  the  original 
image  data.  Using  only  this  encoding  it  is  possible  to  implement  a  large  number  of 
common  document  analysis  tasks  such  as  skew  estimation/correction  [If,  13,  63,  73, 
84],  Optical  Character  Recognition  (OCR)  [71],  and  layout  analysis  [74,  80]. 

In  our  approach,  we  use  an  indexable  representation  [53]  composed  of  independent 
streams  for  the  prototypes,  the  locations  of  symbol  instances,  and  the  residual  maps. 
We  have  shown  that  lossy  compression,  progressive  transmission,  sub-document  re¬ 
trieval,  skew  estimation/correction,  and  keyword  searching  can  all  be  done  efficiently 
using  this  representation  [53]. 

Our  contribution  is  in  the  development  of  a  compression  system  that  promotes 
compressed-domain  analysis  by  allowing  symbol  access.  Although  the  approach  works 
best  with  clean  images  where  patterns  repeat,  it  is  flexible  enough  to  adapt  to  situa¬ 
tions  where  the  components  correspond  to  arbitrary  patterns  in  the  image  as  opposed 
to  symbols,  or  where  many  symbols  are  mis-clustered. 
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Figure  2.4:  Representation  of  a  compressed  document  image  with  relevant  processing 
access. 
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in  common.  Author  keywords  and  a 
function  called  KevWords  Plus  let  us- 
ers  retrieve  additional  information. 


(a) 


(b) 


Figure  2.5:  Example  of  ambiguous  memberships  for  (a)  a  section  of  an  image  and  (b) 
the  resulting  residual  components. 

2.2  Compressed  document  representation 

Prototypes  are  easily  calculated  by  the  formation  of  clusters  and  these  prototypes  are 
substituted  for  the  actual  components  by  indexing  their  locations.  The  specihcation 
of  component  location  is  done  in  two  parts.  For  coding  efficiency  the  page  is  divided 
into  blocks  of  256  x  256  pixels  each  of  which  can  be  addressed  by  a  single  byte.  A 
component  is  indexed  by  the  relative  location  of  the  upper  left  corner  of  its  bounding 
box  with  respect  to  the  upper  left  corner  of  the  block.  (The  number  of  components  in 
each  block  is  also  stored.)  These  locations  and  the  block’s  address  yield  the  absolute 
addresses  of  the  components.  This  method  of  addressing  is  used  in  order  to  provide 
direct  access  to  the  components.  Relative  position  specihcation  is  used  in  compression 
by  Witten  et  ah  [112],  by  specifying  offset  location  from  the  bottom  right  of  the 
previous  component  to  the  top  right  of  the  next  component.  Their  method  has 
smaller  storage  requirements  but  does  not  provide  direct  access  to  the  components. 

Once  we  have  obtained  clusters  and  chosen  prototypes  we  need  to  represent  the 
residuals  (the  differences  between  the  components  and  their  associated  prototypes). 
Regardless  of  the  clustering  algorithm,  we  have  observed  that  there  always  exist 
residuals  that  affect  the  renditions  of  the  original  components,  at  least  as  regards 
readability.  An  example  of  a  set  of  residuals  is  shown  in  Figure  2.5^.  A  residual  map 
is  associated  with  each  component  and  in  a  representation  based  on  symbol  access, 
the  residuals  are  indexed  based  on  their  associated  components.  A  small  region  of 
an  image  was  purposely  chosen  that  had  many  components  that  did  not  cluster  well. 
This  in  not  representative  of  common  residual  maps. 


^In  this  example,  we  see  three  types  of  residuals  -  some  that  do  not  effect  readability  (the 
silhouettes),  some  that  do  affect  readability  (the  letter  ‘K’  was  replaced  by  prototype  d’),  and  one 
that  did  not  cluster  and  relies  solely  on  the  residual  for  rendition  (the  letter  ‘W’). 
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General  Steams 


Figure  2.6:  Data  structure  organization. 

2.2.1  Page  representation 

In  order  to  provide  spatial  access  to  image  components  in  the  compressed  domain, 
it  is  convenient  to  encode  four  types  of  information  in  four  independent  sections: 
the  hie  header  section,  the  prototype  section,  the  symbolic  section  (see  below),  and 
the  residual  section.  The  hie  header  contains  general  information  which  provides 
indexes  to  the  prototype,  symbolic,  and  residual  sections  and  an  index  to  the  header 
section  of  the  subsequent  page  in  a  multi-paged  document.  (In  this  paper  we  have 
considered  only  single-page  documents.)  The  prototype  section  is  a  collection  of 
prototype  bitmaps  and  their  sizes.  The  symbolic  section  provides  an  encoding  of  the 
locations  of  components  in  the  image  and  the  prototype  of  which  each  component  is 
an  instance.  The  residual  section  contains  the  residuals  produced  after  substituting 
the  prototypes  for  the  actual  components.  Each  of  these  four  sections  is  encoded  as  a 
set  of  streams.  Figure  2.6  shows  the  organization  of  the  streams;  they  are  described 
in  greater  detail  in  the  following  subsections. 

2.2.2  Streams 

The  hie  header  section  contains  a  single  header  stream  and  serves  as  an  overall 
road  map  for  the  representation,  specifying  global  document  image  parameters  as 
well  as  indices  to  the  prototype,  symbolic,  and  residual  sections  and  the  streams 
they  contain.  The  prototype  section  encodes  the  library  of  prototypes  and  contains 
two  streams,  the  prototype  size  stream,  which  records  the  size  of  each  prototype 
bitmap,  and  the  prototype  image  stream,  which  encodes  the  actual  bitmaps.  Two 
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extra  NULL  prototypes  are  used  to  serve  as  indexes  for  small  and  large  graphic 
components  which  are  represented  in  the  residual  stream.  These  prototypes  are  used 
to  specify  appropriate  numbers  of  bytes  for  the  sizes  of  the  graphic  components  so 
they  are  indexable. 

The  symbolic  section  specihes  the  location  of  each  component  along  with  the 
prototype  which  represents  it.  To  reduce  storage  requirements  in  recording  the  com¬ 
ponent  locations,  the  image  is  divided  into  blocks  of  size  256x256  and  component 
locations  are  specihed  relative  to  each  block.  To  calculate  the  absolute  address  of  a 
component  during  decompression,  the  block  membership  stream  provides  a  block 
index  and  the  component  layout  stream  provides  the  vertical  and  horizontal  offsets 
of  the  components  within  the  block  along  with  their  prototype  labels.  The  compo¬ 
nent  layout  index  stream  is  used  to  index  the  component  layout  stream  after  it  is 
compressed. 

The  residual  section  stores  the  residual  for  each  component  and  is  indexed  in 
the  same  way  and  in  the  same  order  as  its  symbolic  counterpart.  This  requires  the 
residual  map  index  stream,  an  index  into  the  residual  map,  and  the  residual 
map  stream  itself. 

The  advantage  of  representing  different  types  of  information  by  different  streams 
and  then  indexing  the  streams  is  that  applications  which  need  access  only  to  limited 
information  do  not  need  to  decompress  the  entire  document.  For  tasks  such  as  skew 
estimation  and  correction,  which  may  require  only  the  locations  of  the  components,  we 
need  access  only  to  the  prototype  size  and  component  layout  streams.  For  a  task  such 
as  keyword  searching  or  OCR,  the  prototype  needs  to  be  accessed,  and  minimal  access 
to  the  residual  map  stream  may  also  be  needed.  To  extract  and  decode  subregions 
of  the  document  image,  we  can  access  the  needed  blocks,  and  we  can  then  access  the 
components  in  these  blocks  efficiently  using  the  index  structures. 

The  size  of  each  stream  is  specihed  in  a  stream  header.  The  header  remains 
uncompressed  so  that  access  to  each  stream  is  independent  of  access  to  the  other 
streams.  Fach  stream  can  be  compressed  individually  or  a  common  “dictionary” 
can  be  implemented  and  made  available  in  the  header.  In  our  implementation,  we 
index  into  the  component  layout  stream  by  referencing  block  boundaries  and  into  the 
residual  map  stream  by  referencing  individual  components. 

2.3  Compressing  the  streams 

ffuffman  coding  is  used  to  compress  the  streams;  it  requires 
streams  are  scanned  to  create  a  common  Huffman  lookup  table, 
streams  are  transformed  according  to  the  newly  created  table, 
the  compressed  streams  will  be  discussed  below. 

2.3.1  Header  section 

Header  stream:  This  stream  consists  of  indices  to  the  streams  contained  in  the 
prototype,  symbolic,  and  residual  sections.  The  indices  are  stored  as  the  sizes  of  the 
individual  streams;  they  remain  uncompressed. 


two  steps.  First  the 
Second  the  individual 
Creating  indexes  into 
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2.3.2  Prototype  section 

Prototype  size  stream:  Prototypes  are  limited  to  256x256  pixels  so  we  can  use 
single  bytes  for  their  horizontal  and  vertical  dimensions.  These  pairs  are  ordered 
as  a  stream  and  compressed.  The  compressed  size  is  recorded  in  the  header.  When 
required,  the  compressed  stream  can  be  decompressed  and  used  to  index  the  prototype 
image  stream. 

Prototype  image  stream:  The  prototype  image  stream  orders  the  pixels  of  each 
prototype,  either  by  row  or  column.  The  resulting  array  is  packed  into  the  prototype 
image  stream  and  compressed.  The  compressed  size  is  recorded  in  the  header.  For 
decompression,  the  compressed  stream  is  indexed  by  the  size  stream.  Knowing  the 
sizes  and  number  of  prototypes  from  the  prototype  size  stream,  and  the  size  of  the 
decompressed  prototype  image  stream,  the  stream  can  be  unpacked  into  the  set  of 
two-dimensional  prototypes. 

2.3.3  Symbolic  section 

Block  membership  stream:  We  have  divided  the  document  image  into  non¬ 
overlapping  blocks  of  size  256x256.  To  index  the  blocks,  we  record  the  hrst  row 
and  column  of  the  block  in  two  bytes.  We  create  a  stream  for  the  number  of  compo¬ 
nents  in  each  block,  compress  the  stream  and  record  the  size  of  the  compressed  stream 
in  the  header.  During  decompression  we  extract  the  stream  and  use  it  to  index  the 
blocks. 

Component  layout  stream:  For  each  component  in  a  block,  we  append  to  a  list  its 
position  in  the  block,  by  specifying  a  horizontal  and  vertical  offset,  and  the  cluster  to 
which  it  belongs,  by  specifying  the  identihcation  of  its  prototype.  If  the  component 
is  associated  with  a  NULL  prototype,  we  also  record  its  size  (the  height  and  width 
of  its  bounding  box).  For  each  block,  we  compress  the  list  and  record  its  compressed 
size  in  the  header.  We  add  this  size  to  the  component  layout  index  stream  (described 
below)  in  one  byte  and  start  a  new  record  for  the  next  block.  When  all  components 
in  all  blocks  are  exhausted,  we  record  the  overall  stream  size.  For  full  decompression, 
we  ignore  the  index  stream  and  decompress  the  entire  component  layout  stream.  For 
partial  decompression  we  use  the  size  information  in  the  index  stream  and  decompress 
only  the  needed  part  of  the  component  layout  stream. 

Component  layout  index  stream:  To  index  the  component  layout  stream  we 
form  a  stream  based  on  the  compressed  sizes  of  the  component  layout  records.  We 
compress  the  entire  stream  and  record  the  resulting  size  in  the  header.  When  the 
information  in  the  stream  is  needed,  we  decompress  the  entire  stream. 

2.3.4  Residual  section 

Residual  map  stream:  Using  the  same  encoding  as  for  the  component  layout 
stream,  we  order  the  residual  maps  into  a  stream.  For  each  residual  map,  we  hrst 
pack  the  pixels  into  bytes,  submit  that  portion  of  the  stream  for  compression,  record 
the  compressed  size  in  one  byte,  and  record  the  ending  bit  offset  in  one  byte.  We 
do  this  for  all  residual  images  and  record  the  overall  size  of  the  compressed  residual 
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map  stream  in  the  header.  For  full  decompression,  we  extract  the  entire  residual  map 
stream  based  on  its  compressed  size  and  unpack  it.  For  partial  decompression,  we 
use  the  decompressed  residual  map  index  stream  to  extract  the  needed  segments  from 
the  residual  map  stream. 

Residual  map  index  stream:  For  the  index,  we  form  a  stream  based  on  the 
compressed  size  of  the  list  of  residual  map  streams  for  each  component.  We  compress 
the  entire  stream  and  record  the  resulting  size  in  the  header.  When  the  information 
in  the  stream  is  needed,  we  decompress  the  entire  stream. 

2.4  Results 

To  test  our  system  we  used  text  regions  from  the  hrst  f  22  scanned  images  (A00fBlN.TIF 
to  A006N.TIF)  in  the  UWASH  1  database  which  is  available  from  the  University  of 
Washington  [32].  For  synthetic  images  we  used  UTgX-generated  hies  from  the  same 
database  (L000SYN.TIF  -  L00MSYN.TIF).  Our  basic  compression  package  converts 
binary  TIFF  images  to  and  from  symbolically  compressed  images.  For  synthetic, 
UTgX-generated  images,  it  took  a  total  of  855  seconds  to  compress  23  hies  for  an 
average  of  37  seconds.  The  average  resulting  hie  size  was  45312  bytes  compared  to 
57848  bytes  in  CClTT  G4  coding  and  82112  bytes  in  CClTT  G3.  For  comparison, 
the  same  images  were  compressed  using  packed  bits  and  LZW  compression,  yielding 
average  hie  sizes  of  198,015  bytes  and  110,810  bytes  respectively.  With  original  im¬ 
ages  of  2550  by  3300  pixels,  the  compression  ratios  are  summarized  in  Table  2.1.  For 
the  case  of  scanned  images,  compression  time  was  an  average  of  66  seconds  for  the 
122  hies  of  Table  2.1.  (The  time  is  for  an  implementation  running  on  a  SparcStation 
20  with  the  Solaris  2.5  operating  system.) 


Image 

Symbolic 

G4 

G3 

JBIG 

MG 

LZW 

PACK-BITS 

Synthetic 

23.2 

18.2 

12.8 

26.9 

39 

9.5 

5.3 

Scanned 

12.2 

17.8 

9.7 

22.3 

27 

8.7 

5.5 

Table  2.1:  Average  compression  ratios  for  synthetic  and  scanned  images  for  differ¬ 
ent  compression  schemes.  The  images  are  taken  from  the  University  of  Washington 
Database  [32].  Averages  were  taken  over  20  synthetic  images  and  100  scanned  images. 


2.5  Summary 

We  have  proposed  a  data  organization  that  is  designed  to  work  with  symbolic  com¬ 
pression.  Given  that  we  can  identify  sources  of  redundancy  and  useful  sources  of 
information  we  designed  the  organizational  model  to  address  both.  In  our  methodol¬ 
ogy  we  identihed  the  source  of  redundancy  as  the  repetition  of  components  found  in 
document  images.  The  source  of  useful  information  is  found  to  also  be  in  the  compo¬ 
nents,  but  more  specihcally  it  resides  in  the  locations  and  shapes  of  the  components. 
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Redundancy  is  captured  by  determining  representative  classes  of  observed  compo¬ 
nents,  which  is  easily  done  by  clustering.  These  clusters  are  represented  by  prototypes 
and  are  indicative  of  the  components’  shapes.  Therefore,  prototypes  need  to  be  pre¬ 
served  and  stored  to  be  accessed  individually,  as  we  have  described.  While  the  shapes 
of  components  are  captured  in  prototypes,  their  locations  need  to  be  stored  indepen¬ 
dently  to  allow  for  dissemination  of  useful  information.  This  was  done  by  providing 
for  a  symbolic  layout  section.  The  remaining  information  relates  to  the  residual  cod¬ 
ing.  Redundancy  removal  within  the  residual  section  will  be  addressed  in  Chapter  5. 
Some  useful  information  for  this  purpose  can  be  obtained  from  the  prototype  image; 
to  make  this  information  available  we  must  index  the  residuals  based  on  the  compo¬ 
nents.  We  proposed  a  residual  coding  method  which  is  well  suited  for  partial  retrieval 
and  compressed-domain  processing. 

ffaving  provided  a  specihcation  of  the  data  organization,  we  can  now  proceed  with 
specihcation  of  actual  algorithms. 
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Chapter  3 


Clustering 


3.1  Introduction 

A  large  number  of  clustering  algorithms  have  been  described  in  the  literature  [4,  86]. 
Some  algorithms,  despite  their  wide  usage  for  other  purposes,  are  not  applicable 
to  symbol  clustering.  The  goal  of  symbol  clustering  is  to  identify  the  intra-class 
similarities  of  the  symbols.  Due  to  size  and  pattern  variations  in  the  components, 
some  clustering  algorithms  may  not  provide  a  satisfactory  result,  as  mentioned  below 
for  two  of  the  algorithms. 

One  example  of  an  inappropriate  clustering  algorithm  is  principal  component  anal¬ 
ysis,  where  image  rows  or  columns  are  concatenated  into  a  vector  and  for  a  given  set 
of  symbol  images  a  covariance  matrix  is  computed.  The  eigenvalues  of  this  matrix 
give  the  variances  along  the  eigenvector  directions.  The  effectiveness  of  this  method 
depends  on  the  rows  and  columns  providing  consistent  information  for  all  of  the  im¬ 
ages.  This  will  not  be  the  case  if  the  images  are  of  different  sizes.  To  remedy  this  one 
can  pad  the  observed  image  to  a  hxed  size  with  a  constant  value  or  scale  the  input 
image  to  a  hxed  size.  The  padding  presents  a  problem  when  the  input  images  are 
artihcially  weighted  to  the  size  of  the  padded  region.  In  the  example  shown  in  Figure 
3.1,  it  can  be  seen  that  all  observations  have  similar  shapes  (a  circular  shape  will  over¬ 
lap  all  the  observations  to  some  degree)  in  the  upper  portion  of  the  image,  and  that 
the  lower  portions  of  observations  6  and  7  look  the  most  dissimilar  (one  descender  is 
on  the  left  and  one  is  on  the  right  with  no  overlap).  However,  since  observations  1  to 
5  are  padded,  the  weight  of  the  descenders  in  observations  6  and  7  is  2/7  of  what  it 
should  be  (not  as  discriminating  as  it  should  be).  Normalizing  the  components  is  not 
a  viable  option,  since  prototype  generation  will  not  be  possible  in  a  cluster  where  the 
sizes  of  the  components  are  different.  Scaling  of  images  to  a  normalized  size  creates 
a  situation  in  which  an  appropriate  prototype  may  become  impossible  to  hnd.  The 
fact  of  the  matter  is  that  the  size  information  is  an  effective  discriminator  and  it  can 
be  used  to  effectively  separate  observations  {6,7}  and  {t,2,3,4,5}  of  Figure  3.t. 

A  second  type  of  clustering  algorithm,  vector  quantization  [30],  is  also  affected  by 
the  variability  in  observed  sizes.  In  vector  quantization,  each  symbol  image  is  hrst 
ordered  as  a  vector  and  this  vector  is  considered  to  be  a  point  in  a  high-dimensional 
space.  A  small  set  of  “prototype”  vectors  is  arbitrarily  chosen  and  the  distances  to 
all  the  prototypes  from  all  of  the  observations  are  calculated.  We  associate  with  each 
observation  the  prototype  closest  to  it.  The  prototypes  are  then  rehned  by  taking  the 
means  of  their  associated  observations,  and  the  process  is  repeated.  The  prototypes 
and  the  distance  measure  divide  the  high-dimensional  space  into  regions  of  coarser 
granularity;  hence  the  name  quantization.  The  results  obtained  by  this  method  are 
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Padded  Regions 


Figure  3.1:  Example  of  padding. 


(a) 


(b) 


(c) 


Figure  3.2:  Vector  quantization  regions  in  a  two-dimensional  sample  space  with  a) 
equal-population  clusters;  b)  a  3:t  population  difference;  c)  a  5:t  population  differ¬ 
ence. 


population-dependent;  more  prototypes  will  be  found  in  regions  of  the  space  where 
symbol  images  occur.  Figure  3.2  shows  how  the  regions  change  when  the  number  of 
samples  in  one  of  the  clusters  is  increased. 

A  third  class  of  clustering  methods,  which  we  will  primarily  use  here,  are  pattern 
matching  and  substitution  approaches.  If  a  candidate  pattern  is  a  good  match  to  an 
existing  prototype,  it  is  classihed  as  a  member  of  that  prototype’s  class;  otherwise,  it 
is  considered  for  possible  creation  of  a  new  class.  The  advantages  of  using  this  type 
of  method  are  that  we  do  not  need  a  priori  knowledge  about  the  number  of  classes 
and  that  the  method  works  with  a  variety  of  image  sizes. 

We  can  describe  clustering  via  pattern  matching  more  formally  by  hrst  dehning 
notation  for  an  observation  X  and  a  prototype  P: 


X 


X(n,  m)  for  (n,  m)  e  Tlx  =  ({1,  ■■■,  {1,  ■■■,  Mx}) 

0  otherwise 


(3.t) 


P 


P(n,  m)  for  (n,  m)  G  =  ({t, ...,  tVp},  {t, ...,  Mp]) 
0  otherwise 


(3.2) 


where  it  is  assumed  that  foreground  pixels  have  value  1  and  background  pixels  (and 
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pixels  outside  the  bounds  of  TZx  and  TZp)  have  value  0.  A  number  of  pattern  matching 
methods  which  can  be  used  to  dehne  measures  of  closeness  between  an  observation 
and  a  prototype  are  discussed  below. 

The  rest  of  this  chapter  deals  with  matching  methods  that  can  be  used  in  our  pat¬ 
tern  matching  and  substitution  clustering  process.  We  hrst  discuss  a  simple  Hamming 
distance  in  Section  3.2.  We  extend  this  to  a  weighted  Hamming  distance  in  Section 
3.3.  A  different  approach  is  then  suggested  to  take  into  account  background  and 
foreground  errors  in  Section  3.4.  In  Section  3.5  compression-based  (entropy-based) 
matching  techniques  are  discussed.  In  Section  3.6  we  propose  a  new  matching  tech¬ 
nique  based  on  the  distance  transform  which  has  the  desired  property  of  weighting 
structural  components  appropriately.  We  conclude  with  some  remarks  in  Section  3.7. 


3.2  Hamming  distance 


The  simplest  method  of  matching  two  binary  images  is  to  measure  their  dissimilarity 
by  the  number  of  pixels  that  are  not  equal.  An  error  map  calculated  from  the  exclusive 
OR  (XOR)  of  the  observed  image  and  the  prototype  is  given  by 


E(n^  m) 


X{n,  m)  0  P(n,  m)  for  (n,  m)  G  TZx  H  Tip 

X(n,  m)  for  (n,  m)  G  Tlx  —  Tip 

P{n,  m)  for  (n,  m)  G  Tip  —  Tlx 

0  otherwise 


(3.3) 


where  £'(n,  m)  is  dehned  for  all  (n,  m)  G  Tie  =  Tlx  U  Tip.  Since  the  XOR  operation 
returns  a  value  of  f  for  a  mismatch,  summing  the  error  map  provides  a  measure  of 
mismatch, 

M  =  \Tle\-  M  =  Y.  (3-4) 

(n,m)G7?.e 

where  \Tle  \  is  the  highest  mismatch  score  and  M  is  the  actual  match.  In  this  formula¬ 
tion,  maximizing  the  match  is  the  same  as  minimizing  the  mismatch  of  (3.4).  Figure 

3.3  shows  three  examples  involving  an  intra-class  observation,  a  possibly  confusing 
inter-class  observation,  and  an  obvious  inter-class  observation,  along  with  their  mis¬ 
match  scores.  The  amount  of  mismatch  shows  some  degree  of  discrimination  among 
those  cases,  and  by  use  of  a  threshold,  we  are  able  to  identify  some  classes.  However, 
if  the  threshold  is  too  high  some  intra-class  confusion  may  arise,  such  as  clustering 
‘e’  shapes  with  ‘c’  shapes.  Taking  a  low  threshold  would  result  in  too  many  clusters 
and  would  hinder  compression  efforts,  ft  is  desirable  to  use  a  distance  measure  that 
provides  enough  separation  for  dissimilar  shapes. 


3.3  Weighted  Hamming  distance 

Improving  the  distance  measure  to  discriminate  between  similar  images  (‘e’  and  ‘c’ 
of  Figure  3.3)  is  desirable;  the  weighted  Hamming  distance  [85,  112]  provides  such 
an  improvement.  This  distance  measure  gives  greater  importance  to  error  pixels 
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Figure  3.3:  Matching  results  for  three  examples  of  observed  components.  Observation 
A  is  similar  to  the  prototype,  observation  B  is  close,  and  observation  C  is  very  differ¬ 
ent.  XOR-ed,  Weighted  XORed,  and  Distance-based  differences  are  shown,  as  well 
as  the  pixels  contributing  to  the  mismatch,  and  the  mismatch  value  in  parentheses. 
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which  appear  in  close  proximity  to  other  error  pixels.  Error  pixels  which  appear  close 
together  tend  to  correspond  to  strnctnrally  meaningfnl  featnres. 

The  weighted  Hamming  distance  operates  on  the  error  map  of  (3.3)  by  snmming 
over  a  neighborhood  of  each  error  pixel: 

=  E{n,m)  X  ^  E{k,l)  (3-5) 


where  Af{n,  m)  is  the  3x3  neighborhood  of  the  (n,  m)th  pixel.  With  this  weighting 
strategy,  error  pixels  that  occnr  in  a  gronp  will  give  a  higher  mismatch  than  iso¬ 
lated  error  pixels.  Fignre  3.3  also  shows  the  resnlts  of  nsing  the  weighted  Hamming 
distance  measnre  for  the  same  observations  and  prototype.  Note  that  the  small  dif¬ 
ference  between  observations  A  and  B  is  now  mnch  larger  when  the  mismatch  score 
is  calcnlated  by  (3.4)  and  nsing  the  weighted  error  map  E^j. 

3.4  Sum  of  weighted  AND-NOTs 

When  we  nse  an  XOR  operation  the  sonrce  of  the  errors  is  not  considered.  In  par- 
ticnlar  no  distinction  is  made  between  errors  in  foregronnd  pixels  and  errors  in  back- 
gronnd  pixels,  ft  may  be  desirable  to  give  more  importance  to  the  foregronnd  pixels 
since  most  of  the  information  is  contained  in  them.  This  can  be  done  by  nsing  the 
AND-NOT  measnre.  The  weighted  AND-NOT  map  is  dehned  by 

{XAP){n,m)x  {X  AP){k,l) 

(XAP){n,m)x  (XAP){k,l) 

This  map  is  nsefnl  in  cases  where  the  weighting  has  elevated  the  mismatch  level  dne 
to  misalignment,  as  illnstrated  in  Fignre  3.4. 

3.5  Compression-based  template  matching 

Compression-based  template  matching  [43]  is  a  method  of  matching  which  attempts 
to  measnre  the  mntnal  information  between  the  prototypes  and  the  observations. 
This  is  can  be  done  by  minimizing  the  entropy  of  the  residnal.  The  entropy  of  a 
binary  signal  E  =  {e*,  ii  =  1, ...,  \'R-eW  of  distribntion  P(e)  is 

ff(£)  =  -X:F(e)lo&(P(e))  (3,7) 

e=0 

P(e)  is  estimated  given  E.  Eqnating  Pk{E)  to  a  mismatch  valne  favors  conditions 
where  the  complexity  of  E  is  low  which  translates  into  cases  of  mostly  zeros  or  mostly 
ones.  This  is  rather  different  from  previons  approaches  in  which  the  nnmber  of  pixels 
in  the  residnal  map  is  minimized.  In  compression-based  template  matching,  the  idea 
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Figure  3.4:  Example  of  an  observed  component  compared  to  a  prototype  by  measuring 
mismatch  as  weighted  XOR  and  as  weighted  AND-NOT. 

is  to  keep  the  residual  pixels  non-random  with  respect  to  each  other  and  achieve  a 
lower  entropy  for  the  residual  map.  An  immediate  enhancement  is  to  replace  the 
probability  with  one  that  is  conditioned  on  the  formed  clusters.  Mutual  information 
is  then  measured  by 

I[E\k)  =  -  \og^  P{E{n,m)\k)  (3.8) 

where  k  is  the  cluster  index.  While  this  probability  can  be  estimated  from  the  mem¬ 
bers  assigned  to  the  cluster,  it  can  also  be  estimated  as  a  function  of  the  prototype 
image  (convolution  with  some  hlter  [116]). 

3.6  Distance-based  template  matching 

Our  hnal  method  of  matching  attempts  to  weight  each  error  pixel  according  to  the 
probability  that  it  was  corrupted  by  degradation  or  noise.  In  this  way  we  give  lower 
weights  to  error  pixels  which  are  likely  to  have  resulted  from  noise  and  higher  weights 
to  others. 

In  studies  of  document  image  degradation  at  the  character  level  it  was  observed 
that  pixel  errors  occur  more  often  close  to  the  edges  of  characters  [48].  The  probability 
of  a  pixel  changing  its  value  decays  as  a  squared  exponential  with  the  distance  to  an 
edge.  Specihcally  [48],  the  probability  of  a  pixel  changing  value  from  background  to 
foreground  at  distance  d  from  an  edge  is  given  by 

P(l|d,  a,  &)  =  1  —  P(0|d,  a,  h)  = 
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(3.9) 


Binary  Entropy  Plot 


Figure  3.5:  Plot  of  entropy  for  a  binary  signal  as  a  function  of  sample  probability 

where  a  is  the  background  decay  rate  and  b  denotes  the  fact  that  the  pixel  in  question 
really  belongs  to  the  background.  Similarly,  for  foreground  pixels,  the  probability  of 
changing  value  is 

Pm,l3J)  =  1-P{l\d,l3,f)  =  (3.10) 

We  can  use  this  fact  to  conclude  that  observed  differences  which  occur  between  a 
prototype  and  a  candidate  pattern  away  from  the  edge  of  the  prototype  are  more 
likely  to  be  the  result  of  meaningful  structure  and  not  noise  and  should  therefore  be 
considered  with  higher  priority  during  matching. 

We  can  use  this  degradation  model  as  a  basis  for  a  matching  model  which  assigns 
weights  based  on  relative  distances  from  the  prototype  edge.  To  simplify  the  model, 
we  assume  that  a  is  equal  to  (3  and  calculate  the  weight  matrix  as  a  distance  trans¬ 
form.  This  greatly  reduces  the  computational  requirement  by  avoiding  a  scan  to  all 
components  to  determine  a  and  (3  and  by  removing  an  extra  exponential  operation. 
We  also  use  only  a  relative  distance  measure;  this  allows  us  to  avoid  the  computation 
of  two  squares  and  a  square  root. 

To  compute  the  distance  transform,  consider  a  prototype  P .  Let  T  be  the  set  of 
indices  for  the  foreground  pixels  of  the  prototype,  and  P>  the  set  of  indices  for  the 
background  pixels.  Knowing  that  T  f]  B  =  0,  the  square-of-the-distance  transform, 
or  distance  map,  of  prototype  P  is  calculated  by 

Dp[n,  m)  =  min{  (n  —  +  (m  —  /)^  | 

(fc,  /)  e  JF  ,  {n,m)  ^  B  or 

{k,l)eB,  {n,m)e^}  (3-11) 


This  measures  the  closest  (squared)  distance  of  an  unlike  pixel  to  the  current  pixel. 
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d  d' 

(a)  (b) 

Figure  3.6:  Example  of  a)  an  observation,  b)  a  prototype,  c)  a  residual  map,  and  d) 
the  distance  map  of  the  prototype. 

The  study  of  distance  transforms  dates  back  to  the  late  t960s.  A  classical  reference  by 
Rosenfeld  [87]  discusses  a  number  of  algorithms  for  its  computation,  while  for  images 
represented  by  quadtrees  one  can  use  a  method  suggested  by  Samet  [90].  Other  recent 
algorithms  to  compute  distance  transforms  can  be  found  in  [t6,  33,  60,  t04,  t07]. 

Figure  3.6  shows  a  typical  prototype  and  a  map  of  the  distances  to  the  edges. 
(Figure  3.3  showed  the  distances  overlayed  on  the  prototype  image.)  Summing  the 
error  pixels  weighted  by  their  distance  values  provides  a  better  separation  for  inter¬ 
class  observations.  For  the  intra-class  observations  in  Figure  3.4,  it  is  easy  to  see 
that  the  distance-weighted  mismatch  is  26  since  all  error  pixels  occur  on  the  edges 
and  there  are  only  26  such  pixels.  Distance  weighting  separates  the  closest  inter-class 
observations  shown  in  Figure  3.3  by  ft  and  44. 

It  should  be  pointed  out  that  distance-based  template  matching  could  easily  be 
adapted  to  other  clustering  techniques,  at  the  cost  of  more  computation.  An  example 
is  adaptation  to  vector  quantization.  Consider  the  clusterings  of  Figure  3.2,  and 
suppose  that  for  high  values  of  x  and  y  we  would  like  to  weight  the  x  values  more 
heavily,  and  for  low  values  we  would  like  to  weight  the  y  values  more  heavily.  This 
would  shift  the  region  boundaries  of  Figure  3.2a  in  such  a  way  that  at  the  top  right  we 
would  have  vertical  thin  regions  and  at  the  lower  left  we  would  have  horizontal  thin 
regions.  The  result  would  be  a  partition  that  is  dependent  on  location  in  the  space 
as  well  as  on  the  population  of  observations  in  that  area  of  the  space.  The  distance 
ordering  has  a  number  of  desirable  properties,  and  provides  a  good  framework  for 
document  image  processing  and  handling. 

3.7  Summary 

In  this  chapter  we  considered  a  number  of  clustering  algorithms  that  might  be  used 
in  our  compression  system.  Due  to  the  size  variability  of  the  input  images  and  the 
unknown  number  of  clusters,  we  choose  a  match-and-classify  approach.  This  ap¬ 
proach  allows  for  the  use  of  different  matching  functions  so  that  specihc  features  of 
observations  can  be  exploited.  Prototype  generation  is  easy  once  a  set  of  clusters  is 
determined.  We  simply  average  the  cluster  members  to  arrive  at  their  representative 
prototypes. 

The  simplest  match  criterion  tested  was  the  sum  of  the  Hamming  distances,  which 
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was  immediately  improved  to  a  weighted  sum  of  Hamming  distances.  While  Hamming 
distance  sums  the  mismatch  valne  as  the  nnmber  of  pixels  that  were  nneqnal,  weighted 
Hamming  distance  pnts  more  weight  on  gronpings  of  pixels  which  were  nneqnal.  This 
provided  a  reasonable  improvement.  We  also  considered  an  alternate  distance,  namely 
AND-NOT  distance.  The  reason  for  considering  AND-NOT  is  that  for  the  XOR 
fnnction,  the  ontpnt  is  independent  of  the  fact  that  foregronnd  and  backgronnd  pixels 
of  either  the  prototype  or  the  component  may  not  match.  There  may  be  sitnations 
where  a  slight  misalignment  contribntes  a  large  amonnt  toward  the  mismatch  valne. 
By  nsing  the  AND-NOT  fnnction,  the  gronpings  of  nneqnal  pixels  are  weighted  based 
on  their  foregronnd  and  backgronnd  memberships. 

Another  class  of  matching  fnnctions  tested  was  entropy-based.  This  class  main¬ 
tains  a  low  level  of  complexity  for  the  residnals;  however,  it  is  not  gnaranteed  that 
the  prototypes  resemble  meaningfnl  components.  The  matching  measnre  is  the  com¬ 
plexity  of  the  residnal  image.  A  simple  connt  of  residnal  pixels  yields  a  probability 
mass  distribntion  that  is  nsed  to  calcnlate  an  entropy  measnre.  This  measnre  was 
adapted  to  inclnde  a  condition  that  the  entropy  of  the  residnals  belonging  to  a  class 
be  minimized.  Since  all  pixels  are  not  created  eqnal,  especially  for  docnment  images, 
the  entropy  contribntions  of  all  pixels  cannot  be  nniformly  combined;  they  need  to 
be  weighted  based  on  their  potential  contribntion  toward  appropriate  prototype  gen¬ 
eration.  Entropy-based  template  matching  may  contribnte  towards  the  compression 
of  images,  bnt  it  is  more  desirable  to  create  prototypes  which  contribnte  toward  cor¬ 
rect  recognition.  Once  a  clnstering  algorithm  can  gnarantee  correct  assignment  of 
components  to  clnsters  (i.e.  all  clnsters  consist  of  components  which  wonld  be  recog¬ 
nized  as  the  same  symbol  as  the  prototype  by  a  hnman  reader),  we  can  then  employ 
compression-based  clnstering  to  fnrther  minimize  residnal  complexity  while  keeping 
recognition  intact.  Snch  performance  has  never  been  achieved;  most  researchers  in 
this  area  conhrm  that  correct  prototype  representation  is  possible  only  for  docnment 
images  that  are  of  good  qnality. 

A  distance-based  approach  is  snggested  here  to  address  some  concerns  regarding 
the  pixels’  contribntions.  A  model  based  on  character  degradation  is  nsed  for  this 
pnrpose.  This  model  snggests  that  pixels  close  to  edges  of  components  are  most  likely 
to  change  in  valne  dne  to  degradation  effects.  We  extend  onr  model  to  hypothesize 
that  pixels  far  from  edges  mnst  contain  large  amonnts  of  recognition  information. 
This  is  snpported  by  the  fact  that  most  docnment  images  that  are  degraded  are  still 
readable.  Therefore  we  associate  a  distance  measnre  with  each  pixel  and  weight  its 
difference  from  the  prototype  based  on  that  distance.  This  allows  discrimination 
of  shapes  that  are  fairly  similar  bnt  have  strongly  dissimilar  strnctnral  components. 
This  favors  the  clnstering  of  similar  strnctnrally  signihcant  patterns.  While  we  still 
cannot  gnarantee  that  clnster  members  will  all  be  recognized  as  the  same  symbol,  we 
have  contribnted  toward  achieving  a  single  recognizable  component  per  clnster. 

There  exist  several  clnstering  algorithms  which  vary  in  performance,  bnt  there 
exists  no  clnstering  algorithm  that  performs  perfectly.  Therefore,  it  is  essential  to 
inclnde  at  least  partial  residnal  information  to  assnre  a  representation  that  allows  for 
correct  recognition  of  characters.  Distance-based  matching  provides  good  clnstering  of 
components  based  on  their  strnctnral  featnres  and  readability,  bnt  it  does  not  operate 
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perfectly,  so  residual  coding  is  needed  to  address  concerns  beyond  clustering.  This 
will  be  covered  in  Chapter  5.  Another  concern  is  the  quality  of  the  input  components. 
In  Chapter  4  we  revisit  component  segmentation  to  see  if  the  input  components  are 
well  suited  for  our  clustering  methods. 


3t 


Chapter  4 


Segmentation-based  clustering 


4.1  Introduction 

Enhancements  to  clnstering,  snch  as  the  work  done  by  Bloomberg  and  Vincent  [14] 
and  Zhang  and  Danskin  [116],  can  greatly  improve  performance  both  in  lossless  and 
lossy  compression.  While  there  have  been  efforts  to  improve  clnstering  in  the  symbolic 
compression  of  docnment  images,  it  is  nnrealistic  to  expect  clnstering  alone  to  address 
all  aspects  of  prototype  generation.  For  characters  that  are  connected  or  broken  in 
the  docnment  image,  a  segmentation  based  on  connected  components  will  not  yield 
isolated  characters  after  clnstering,  no  matter  how  good  the  clnstering  is.  In  this 
case  the  resnlting  set  of  prototype  image  maps  contains  a  certain  level  of  rednndancy 
which  was  not  adeqnately  captnred  by  the  segmentation  and  clnstering  algorithms. 
If  a  clnstering  algorithm  does  not  reqnire  the  ontpnt  of  the  segmentation  rontine,  it 
is  considered  to  be  a  joint  segmenter  and  clnsterer  and  does  not  fall  in  the  category  of 
simple  clnstering  algorithms.  An  ideal  clnstering  rontine  wonld  interact  intelligently 
with  the  segmentation  rontine  and  derive  the  best  possible  prototype  images  while  the 
segmentation  rontine  provides  a  better  segmentation  given  the  nnderlying  prototype 
maps.  Eventnally,  in  residnal  coding,  which  involves  a  majority  of  the  data,  the  data 
can  be  efficiently  organized  and  coded,  as  has  been  demonstrated  by  Howard  [37]  and 
as  shown  in  Chapter  5. 

For  the  general  symbolic  compression  problem,  segmentation  has  not  been  a  major 
issne.  This  is  becanse  any  component  regardless  of  shape  and  size  will  be  coded 
appropriately,  althongh  not  as  efficiently  as  it  conld  be.  With  the  introdnction  of 
small  amonnts  of  degradation  in  the  original  image,  the  strnctnral  integrity  of  the 
character  shapes  degrades  and  the  components  either  start  tonching  or  breaking  apart. 
It  is  therefore  necessary  to  review  alternative  methods  of  segmentation  to  improve 
compression  performance  for  snch  docnments.  An  example  of  snch  a  docnment  is 
shown  in  Fignre  4.f. 

Initially  we  nsed  a  connected  component  algorithm  for  segmentation.  Althongh 
snch  segmentation  can  be  done  qnickly,  the  process  is  sensitive  to  connected  or  bro¬ 
ken  characters.  While  most  of  the  spatial  featnres  of  connected  characters  are  still 
preserved,  their  segmentation  remains  a  non-trivial  task.  In  symbolic  compression, 
fnndamental  character  shapes  are  identified  and  clnstered  to  generate  prototypes  that 
embody  the  constitnent  patterns  of  the  image.  In  cases  of  tonching  characters,  clnsters 
will  be  created  for  observed  instances  and  constitnent  shapes  will  not  be  identified. 
We  propose  to  nse  the  prototypes  obtained  in  the  first  pass  to  perform  partial  match¬ 
ing  and  an  alternative  segmentation  on  those  components  which  have  not  clnstered 
well.  The  ontpnt  of  the  segmentation  is  fed  into  the  clnstering  rontine  again  for  an 
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A  legitimate  concern  of  some  resellers,  however, 
permitted  to  sell  their  existing  installed  base.  For  exj 
has  been  granted  rights  to  market  certain  software  as  ] 
the  VAR  may  want  the  right  to  provide  that  software  £ 
customers  who  do  not  initially  acquire  the  software  wit! 
decide  to  add  that  functionality  later.  Software  supplii 
limit  a  VAR’s  rights  in  this  regard  by,  for  example. 


Figure  4.1:  Example  of  a  document  image  containing  a  large  set  of  connected  char¬ 
acters. 

enhanced  set  of  representative  prototypes.  This  process  can  be  repeated  to  improve 
the  compaction  of  the  prototype  library. 

To  improve  clustering,  compression,  and  processing  of  document  images  we  con¬ 
sider  an  integrated  segmentation  and  recognition  approach  where  the  initial  segmen¬ 
tation  and  identihcation  of  clusters  identihes  stable  prototypes.  We  then  iterate  a 
process  to  rehne  both  segmentation  and  clustering  of  components.  The  prototypes 
from  the  original  clustering  are  used  to  re-segment  the  previous  components  and  gen¬ 
erate  a  new  set.  The  new  set  of  components  is  clustered  to  form  new  prototypes,  and 
so  on.  This  process  is  performed  iteratively  with  varying  degrees  of  segmentation  to 
determine  the  set  of  prototypes  for  the  image  that  yield  the  best  compression. 

This  chapter  is  organized  as  follows.  In  Section  4.2  we  describe  our  general  ap¬ 
proach  toward  a  joint  segmentation  and  clustering  methodology.  In  Section  4.3  we 
describe  how  subcomponent  matching  is  performed  to  arrive  at  possible  segmentation 
alternatives  and  in  Section  4.4  we  describe  the  segmentation  method  itself.  In  Section 
4.5  we  study  the  integrated  process  of  segmentation  and  clustering  which  provides  a 
basis  for  enhanced  segmentation.  In  Section  4.6  we  present  the  results  and  in  Section 
4.7  we  summarize  and  make  comments  about  the  algorithm,  including  its  limitations 
and  future  enhancements. 

4.2  Approach 

To  provide  integrated  segmentation  and  clustering,  we  take  the  connected  components 
of  the  document  image  as  the  hrst  guess  at  the  components,  cluster  them,  and  gen¬ 
erate  a  prototype  set  as  shown  in  Figure  4.2.  Using  these  prototypes,  we  attempt  to 
resegment  larger  components  by  considering  all  possible  segmentations  of  the  compo¬ 
nents,  matching  and  attempting  to  identity  sub-instances  of  the  previous  prototypes 
(Figure  4.3).  The  entire  list  of  components  is  visited  and  all  combinations  of  vertical 
cuts  are  considered  in  order  to  determine  the  best  possible  match.  The  list  of  match 
values  associated  with  each  segmentation  is  recorded.  During  segmentation,  the  best 
match  of  a  sub-component  to  a  prototype  is  chosen  and  the  associated  cut-point  is 
used  to  create  a  new  observation.  After  considering  all  components  for  segmentation. 
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Figure  4.2:  Example  of  a  prototype  set  that  is  rich  in  touching  components. 

a  new  component  list  is  created  which  reflects  the  new  instances.  These  components 
are  then  used  for  clustering  and  to  create  a  new  set  of  prototypes.  The  process  of 
segmentation  and  clustering  is  repeated  several  times  to  allow  for  segmentation  of 
large  components. 

4.3  Subcomponent  matching 

For  components  that  consist  of  touching  characters,  we  assume  that  prototypes  resem¬ 
bling  the  characters  which  they  contain  are  present  and  that  we  need  to  identify  them 
and  locate  their  positions.  To  identify  these  sub-instances,  it  is  necessary  to  match 
smaller  components  with  parts  of  larger  components.  The  two  factors  in  finding  this 
match  are  shape  and  position.  A  number  of  other  factors  need  to  be  discussed;  how¬ 
ever,  the  general  methodology  is  to  obtain  the  largest-size  set  of  prototypes  that  are 
independent.  This  is  required  since  blind  decomposition  of  patterns  into  their  small¬ 
est  constituents  results  in  a  set  of  the  smallest  prototypes,  such  as  periods.  Figure 

4.4  shows  several  prototypes  from  the  library  in  Figure  4.2.  It  is  clear  that  the  “ted” 
cluster,  having  a  number  of  observed  instances  assigned  to  it,  is  composed  of  touching 
components,  but  how  to  define  the  set  of  its  sub-components  is  unclear.  It  is  clear 
that  decomposition  of  the  “e”  prototype  (Figure  4.4d)  is  desired;  however,  a  better 
decomposition  is  that  of  the  “ed”  prototype  (Figure  4.4e),  since  it  is  a  constituent 
but  one  that  is  bigger  in  size  than  the  “e”  prototype.  The  reason  for  this  is  that  if 
most  instances  of  “e”  appear  as  “ed”,  then  representing  “ed”  is  more  efficient  than 
representing  “e”.  Since  this  is  not  the  case  for  the  most  part,  we  can  conclude  that 
the  “ted”  pattern  should  first  be  decomposed  into  “t”  and  “ed”  and  then  “ed”  should 
be  decomposed  into  “e”  and  “d”.  The  components  shown  in  Figure  4.4c  and  f  are 
also  constituents  but  this  level  of  decomposition  is  clearly  undesirable. 

Since  we  are  dealing  with  rectangular  bounding  boxes  to  segment  regions  of  com- 
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Figure  4.3:  Location  of  prototype  sub-instances  in  connected  components. 

ted  ted  r 

(a)  (b)  (c) 

0  0d 

(d)  (e)  (t) 


Figure  4.4:  Set  of  observations  considered  for  segmentation  (a-c)  and  corresponding 
proposed  constituents  (d-f). 
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ponents  and  the  text  lines  are  horizontal,  we  consider  only  vertical  cnts.  The  al¬ 
gorithm  to  generate  segmentation  hypotheses  is  as  follows.  For  each  vertical  cnt, 
we  determine  the  minimum  bonnding  boxes  of  the  two  resnlting  components.  The 
resnlting  components  are  then  matched  to  all  prototypes  which  they  may  contain. 
The  mismatch  is  measnred  by  the  snm  of  nneqnal  pixels  weighted  by  the  distance 
transform.  Given  the  distance  transform  of  (3.ff), 

Dp{n,  m)  =  min{  (n  —  k)^  +  (m  —  l)^  | 

(k,  /)  G  JF  ,  {n,m)  E  B  or 

{k,l)eB,  {n,m)  e  (4.f) 

the  mismatch  M  is  calcnlated  by 

M  =  ^  E(n,  m)  x  Dp{n,  m)  (4-2) 

n,m 

where  E(n^  m)  is  the  error  between  the  segmented  component  X(n,  m)  and  prototype 
P(n,  m)  (3.3), 


E{n^  m) 


{{X  ©  P)(n,  m) 
X(n,  m) 

P{n,  m) 

0 


(n,  m)  G  TZx  H  TZp 
(n,  m)  G  TZ-x  —  P-p 
(n,  m)  G  Tip  —  Tlx 
otherwise 


(4.3) 


A  match  is  identihed  if  the  minimnm  mismatch  is  lower  than  a  threshold.  Fignre  4.5 
shows  the  best  segmentation,  where  the  segmented  portion  matched  to  an  existing 
prototype.  In  snch  a  scenario,  if  one  segmented  component  shows  a  good  match,  the 
cnt  is  kept  regardless  of  the  match  between  the  other  component  and  any  existing 
prototype.  This  is  necessary  in  order  to  decompose  larger  components  which  have 
more  than  two  constitnents.  For  components  having  three  or  more  constitnent  ele¬ 
ments,  decomposition  can  take  place  from  the  right  or  left  and  can  proceed  to  the 
middle. 

To  ensnre  proper  segmentation  and  avoid  over-segmentation  as  was  shown  in  Fig¬ 
nre  4.4c  and  f,  several  factors  are  considered.  The  hrst  factor  is  the  position  of  the 
segmentation  cnt.  ft  is  nndesirable  to  segment  a  component  at  its  extreme  right  or 
left  edges.  If  cnts  at  these  locations  are  picked,  they  are  most  likely  dne  to  small 
prototypes  which  do  not  resnlt  in  an  adeqnate  decomposition.  The  distance  of  the 
cnt  from  the  middle  of  the  segmented  component  takes  this  factor  into  acconnt. 

The  second  factor  is  favors  components  that  are  tonching  along  small  nnmbers 
of  pixels.  The  nnmber  of  foregronnd  pixels  that  coincide  with  the  segmentation  cnt 
takes  this  factor  into  acconnt. 

The  remaining  factors  are  based  on  size,  ft  is  more  desirable  to  decompose  larger 
components,  so  we  take  the  size  of  the  component  considered  for  resegmentation  into 
acconnt.  ft  is  more  desirable  to  decompose  a  component  into  the  biggest  prototypes 
that  are  smaller  (at  least  in  area)  than  the  given  component,  and  so  the  area  of  the 
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Observation  Prototype 


Distance  Transform  of  Prototype 


Figure  4.5:  Best  proposed  segmentation  cut  for  an  observation  and  a  corresponding 
template  that  was  matched  based  on  the  distance  transform. 


prototype  considered  for  best  segmentation  is  taken  into  account.  Figure  4.6  shows 
examples  of  these  factors.  In  terms  of  weights,  we  use  a  mismatch  value  of 

—  —  W  eight  I  X  W  eiqht2 

M^=M  X  \ 

Weights  X  W eight4^ 

Upon  completion  of  component  matching,  the  best  cut,  the  mismatch  value,  and 
the  closest  matched  prototype  are  kept.  The  segmentation  process  then  uses  these 
values  to  determine  the  best  segmentation. 

4.4  Segmentation 

For  segmentation,  we  need  to  consider  the  cuts  considered  in  partial  matching  for 
cases  of  touching  characters.  Fach  component  is  visited  and  from  among  all  possible 
cuts,  only  the  cut  that  yields  the  best  partial  match  is  considered.  Note  that  only  one 
cut  is  allowed  per  visit,  greatly  reducing  the  complexity  of  segmentation  versus  the 
more  general  two-cut  scheme.  Using  two  cuts,  a  number  of  fail-safe  mechanisms  need 
to  be  considered  to  avoid  empty  objects  and  residual  objects  created  by  taking  a  cut 
that  is  not  exactly  on  the  edge  of  a  neighboring  constituent  component.  The  best 
match  is  then  thresholded  to  either  keep  the  cut  or  discard  it.  Various  thresholding 
constants  were  used  with  varying  amounts  of  effect  on  compression.  Fxamples  of 
components  that  were  segmented  in  this  way  are  shown  in  Figure  4.7. 
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(b)  (d) 


Figure  4.6:  Mismatch  weighting  by  considering  (a)  the  cut  position,  (b)  overlap  with 
foreground  pixels,  (c)  component  size,  and  (d)  prototype  size. 


th  ting  ed  cal 

(a)  (b)  (c)  (d  ) 

Figure  4.7:  Fxamples  of  segmented  components. 
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Figure  4.8:  Prototypes  of  document  image  associated  with  Figure  4.2  after  integrated 
segmentation  and  clustering. 

4.5  Integrated  segmentation  and  clustering 

After  the  original  segmentation  (connected  component  analysis)  and  clustering,  fur¬ 
ther  segmentation  is  performed  by  specifying  a  threshold  on  the  partial  matches  to  the 
components.  All  initial  components  are  considered  for  segmentation.  If  the  mismatch 
value  is  below  the  threshold,  the  initial  component  is  divided  into  two  parts.  After 
all  components  have  been  considered  for  segmentation  and  divided  as  necessary,  the 
clustering  algorithm  is  applied  to  the  new  set  of  observations.  This  process  is  iterated 
repeatedly  to  allow  for  sufficient  decomposition  of  large  connected  components.  The 
prototype  library  originally  shown  in  Figure  4.2  looks  like  the  one  shown  in  Figure 
4.8  after  processing. 

Integrated  segmentation  and  clustering  is  also  applicable  to  the  case  of  discon¬ 
nected  characters.  A  reverse  approach  needs  to  be  taken  in  which  combining  com¬ 
ponents  is  done  instead  of  dividing  them.  A  nearest  neighbor  approach  is  used  to 
combine  components,  and  the  results  are  matched  to  the  prototype  library.  The  clos¬ 
est  match  is  kept  if  its  mismatch  value  is  below  a  threshold.  The  same  methodology 
is  used  as  in  the  matching  of  partial  components  to  prototypes  except  that  the  cost 
factor  associated  with  a  cut  is  replaced  by  one  that  is  associated  with  component 
combination.  Prototype  instances  that  are  larger  than  the  broken  component  (prior 
to  joining)  are  considered  and  the  smallest  such  prototype  is  favored.  This  alternative 
method  of  resegmentation  is  also  benehcial. 

4.6  Results 

The  algorithm  can  be  evaluated  in  terms  of  its  performance  in  prototype  generation, 
component  decomposition,  effectiveness  of  prototype  library  generation,  and  residual 
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Compressed 

Size 

Library 

Size 

Pointer 

Size 

Residnal 

Size 

File 

(I) 

(F) 

(I) 

(F) 

(I) 

(F) 

(I) 

(F) 

Sf 

44260 

41552 

4550 

2642 

3158 

3948 

34038 

31931 

S2 

39828 

38240 

5600 

5160 

2258 

2443 

20088 

28648 

S3 

27300 

26041 

3563 

2703 

1579 

1921 

20763 

19833 

S4 

30397 

28926 

4640 

2481 

1787 

2069 

22377 

22633 

S5 

27014 

26416 

2705 

1971 

2818 

3198 

19318 

18834 

S6 

25050 

24489 

3347 

1748 

2254 

2950 

17648 

17537 

S7 

34818 

33941 

3724 

3448 

3207 

3393 

25537 

24648 

S8 

40100 

38207 

3544 

3252 

3856 

3984 

29811 

27985 

S9 

51713 

49778 

4520 

4116 

3927 

4340 

40280 

38084 

SA 

23978 

22987 

3331 

2812 

1715 

2035 

17459 

16490 

Total 

390678 

376527 

44630 

34896 

30468 

34551 

281618 

280648 

%  A 

(4)% 

(23)% 

14% 

(.3)% 

Table  4.1:  Statistics  of  selected  files  with  initial  (1)  and  final  (F)  compression  char¬ 
acteristics  after  applying  integrated  segmentation  and  clnstering  algorithm. 


image  generation.  While  the  overall  gain  in  compression  is  important,  it  is  also 
important  to  see  how  this  gain  is  achieved. 

Table  4.1  shows  statistics  for  a  selected  gronp  of  docnment  images.  For  debngging 
pnrposes  we  considered  small  samples  of  docnment  images  which  had  large  nnmbers 
of  connected  components.  Images  that  did  not  have  high  nnmbers  of  connected 
components  showed  only  a  small  amonnt  of  improvement.  For  the  examples  in  Table 
4.1  we  performed  symbolic  compression  with  and  withont  segmentation  to  stndy 
the  amonnt  of  improvement.  We  nsed  a  representation  that  stored  the  prototype 
library,  symbol  location,  and  residnal  information  separately,  so  that  the  effects  of 
segmentation  conld  be  stndied  across  all  aspects  of  the  symbolic  compression. 

The  overall  compression  factor  was  observed  to  have  decreased  by  4.5%.  However, 
this  was  dne  to  the  small  image  sizes;  the  clnstering  process  was  not  able  to  nse  the  rich 
popnlation  of  observed  components  available  in  larger  images.  The  compression  factor 
is  expected  to  increase  for  large  images  where  there  exist  large  nnmbers  of  components 
as  well  as  instances  of  tonching  characters  that  span  a  large  nnmber  of  characters. 
The  interesting  resnlt  is  in  the  rednction  in  the  size  of  the  prototype  library.  Since 
the  prototype  library  is  a  small  part  of  the  representation,  it  does  not  affect  overall 
performance  bnt  it  is  the  major  portion  of  the  lossy  compression  representation.  For 
lossy  compression  consisting  of  only  the  prototype  images,  we  need  to  code  66083  bits 
when  we  do  not  nse  segmentation,  bnt  only  60614  bits  when  we  do  nse  it,  resnlting  in 
an  8%  increase  in  compression.  The  effects  of  segmentation  on  processing  are  clear, 
in  that  the  prototypes  are  more  representative  of  the  characters  and  provide  a  better 
symbolic  representation.  Basically,  any  resnlt  that  is  obtained  by  compressed-domain 
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processing  is  more  dependable  with  segmentation  than  withont. 


4.7  Summary 

It  is  logical  to  consider  alternative  methods  of  segmentation  in  addition  to  the  ba¬ 
sic  method  of  connected  components.  In  most  cases,  docnment  images  are  of  high 
qnality  and  the  connectedness  of  characters  is  preserved.  This  cannot  be  gnaran- 
teed,  however,  and  it  is  desirable  to  devise  a  method  that  can  gracefnlly  deal  with 
degradation.  Becanse  proportional  fonts  are  often  nsed  in  typesetting,  and  becanse  of 
the  point  spread  fnnctions  of  scanners,  we  observe  a  larger  redaction  in  performance 
dne  to  tonching  components  than  dne  to  broken  components.  In  this  chapter  we 
have  described  a  method  that  can  segment  connected  components  nsing  the  shapes 
of  dominant  components.  This  method  has  the  desirable  characteristic  that  a  stable 
level  of  representation  is  achieved  in  component  shape  and  size.  It  also  improves 
compression  by  4%.  It  is  important  to  note  that  this  method  is  independent  of  the 
clnstering  algorithm  [116]  and  data  representation  [53,  112]  nsed,  and  that  it  can 
beneht  from,  and  contribnte  to,  their  operation.  For  example,  a  better  clnstering 
algorithm  that  can  achieve  a  higher  level  of  separation  can  provide  better  prototypes 
to  be  nsed  in  partial  matching.  Better  segmentation  resnlts  can  increase  the  per¬ 
formance  of  compressed-domain  processing  in  tasks  snch  as  key-word  searching  and 
skew  detection. 

The  method  presented  here  has  not  been  optimized  in  any  way,  bnt  it  performs 
compression  in  a  few  minutes  on  a  SparcStation  20.  The  images  nsed  for  testing  it 
were  derived  from  the  University  of  Washington  Database  CD  ROM  [32],  by  cropping 
images  to  contain  rich  text  content,  withont  scanning  and  copying  artifacts. 
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Chapter  5 


Residual  coding 


The  primary  contribution  of  this  chapter  is  in  the  development  of  new  methods  of 
residual  coding.  These  methods  provide  a  hierarchical  representation  and  allow  for 
compression  and  compressed-domain  processing,  with  natural  extensions  to  lossy  com¬ 
pression  and  progressive  transmission. 

5.1  Introduction 

In  the  previous  chapters,  we  considered  a  number  of  methods  of  determining  suitable 
prototypes.  Since  the  components  that  appear  in  document  images  contain  several 
levels  of  ambiguity,  it  becomes  unrealistic  to  pursue  a  conservative  approach  towards 
prototype  generation.  The  advantage  of  a  prototype  generating  process  that  guaran¬ 
tees  a  high  level  of  correct  recognition  of  components  is  that  it  results  in  a  large  set  of 
prototypes.  The  factors  that  affect  the  choice  of  an  ensemble  of  prototypes  are  image 
degradation,  shape  redundancy /variability,  cluster  membership,  and  an  acceptable 
level  of  residual  complexity. 

Considerable  complexity  exists  in  the  residual  images.  It  is  desirable  to  code  the 
residuals  so  as  to  achieve  as  high  a  level  of  compression  as  possible  and  provide  a 
hierarchical  representation  that  degrades  gracefully  and  provides  correct  recognition 
of  components.  Much  like  the  preservation  of  structural  components  during  the  pro¬ 
totype  generation  phase,  it  is  possible  to  structurally  order  the  residuals  of  the  image 
components  and  achieve  a  graceful  degradation  of  their  structural  components  [52]. 
We  pursue  three  methods  of  coding  based  on  structure.  We  hrst  consider  a  distance- 
based  coding  method  that  preserves  structural  components  when  residuals  are  coded, 
and  we  also  consider  two  other  methods  that  attempt  to  capture  the  structural  com¬ 
ponents  when  residual  pixels  are  coded. 

The  remainder  of  this  chapter  is  organized  as  follows.  In  Section  5.2  we  discuss  the 
basic  distance-ordered  code.  In  Section  5.3  we  extend  the  basic  distance  ordering  to 
structural  ordering.  In  Section  5.4  we  analyze  the  code  in  terms  of  its  compressibility 
by  performing  an  entropy  analysis.  In  Section  5.5  we  present  the  results  of  the 
analysis.  In  Section  5.6  we  conclude  with  some  hnal  remarks. 

5.2  Distance-ordered  residual  coding 

The  nature  of  the  residuals  is  clearly  dependent  on  the  quality  of  the  original  image, 
i.e.  how  well  symbols  can  be  clustered,  and  on  the  quality  of  the  clustering  algorithm 
itself.  In  clean  document  images,  components  are,  for  the  most  part,  isolated  from 
each  other  and  they  can  easily  be  segmented  and  clustered  to  form  well-dehned  groups. 
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When  the  components  and  their  prototypes  are  very  similar,  little  contribntion  from 
the  residnal  maps  is  necessary  to  accnrately  encode  the  symbols.  As  the  images 
degrade,  however,  noise  is  introdnced  and  characters  tend  to  either  merge  to  form 
bigger  components  or  break  into  mnltiple  pieces.  This  increase  in  the  variability  of 
the  components  in  the  clnsters  resnlts  in  an  increase  in  the  complexity  of  the  residnal 
map. 

An  issne  independent  of  how  the  residnals  may  change  is  how  well  the  clnstering 
algorithm  performs  in  the  presence  of  increased  noise.  A  clnstering  algorithm  needs  to 
captnre  meaningfnl  similarities  between  degraded  symbols.  For  document  images,  the 
strnctnral  aspects  of  symbols  are  the  most  meaningfnl  when  the  ability  to  recognize 
or  read  the  symbols  is  considered.  Whatever  levels  of  image  qnality  and  clnstering 
qnality  are  achieved,  it  is  desirable  to  code  the  residnals  so  that  a  compact  and  nsable 
representation  is  achieved. 

Docnment  image  degradation  research  snggests  that  the  edge  pixels  of  an  ideal 
image  are  most  snsceptible  to  noise  degradation.  Nagy  et  ah  [72]  and  Sarkar  [91] 
nsed  a  point  spread  fnnction  to  determine  the  effects  of  degradation  on  characters 
in  processes  like  OCR.  Kannngo  et  ah  [48]  nsed  a  qnadratic  exponentially  decaying 
fnnction  to  model  the  probabilities  that  foregronnd  and  backgronnd  pixels  are  affected 
by  noise.  Extrapolating  their  hndings,  it  can  be  hypothesized  that  1)  pixels  close  to 
edges  are  more  likely  to  be  affected,  and  pixels  at  the  same  distance  from  edges 
have  similar  noise  probabilities;  2)  pixels  far  from  edges  contribnte  the  most  to  the 
strnctnre  (and  recognition)  of  the  components.  In  snpport  of  the  second  hypothesis, 
it  can  be  observed  that  most  docnment  images  are  still  readable  after  introdncing  a 
large  amonnt  of  degradation  at  or  near  edges. 

The  above  degradation  formnlation  can  be  nsed  to  create  a  model  that  attempts 
to  distingnish  noise  pixels  from  informative  strnctnre-contribnting  pixels  dnring  clns¬ 
tering  and  coding.  The  formnlation  predicts  that  pixels  farther  from  an  edge  of  a 
component  contribnte  more  to  the  readability  of  the  component  than  do  close  ones. 
We  therefore  order  the  residnal  map  by  distance  so  that  the  pixels  which  contribnte 
the  least  to  recognition  will  be  coded  last.  This  is  implemented  by  taking  the  distance 
transforms  of  the  prototypes  and  ordering  the  residnals  by  distance,  from  farthest  to 
closest.  Consider  the  dehnitions  of  observation  X,  prototype  P,  error  map  P,  and 
distance  transform  given  in  (3.f)-(3.3)  and  (3.ff).  The  contents  of  E  are  ordered 
based  on  the  distance  transform  of  the  prototype  Dp[n,m). 

A  typical  prototype,  clnster  member,  residnal  map,  and  distance  transform  map 
are  shown  in  Fignre  5.f .  The  residnal  in  this  example  conforms  well  to  the  model  since 
the  prototype  and  observation  are  of  the  same  class  and  most  of  the  residnal  pixels  lie 
close  to  the  edges  of  the  prototype,  ft  is  clear  that  for  the  examples  in  Fignre  5.1  the 
residnal  pixels  are  primarily  noise  since  the  prototype  and  the  clnster  member  both 
resemble  the  character  ‘d’.  However,  if  an  ‘o’  were  clnstered  with  the  ‘d’,  the  residnal 
content  wonld  contribnte  signihcantly  to  the  ability  to  recognize  it  correctly.  Given 
these  observations,  a  distance  ordering  is  necessary  to  exploit  the  residnal  statistical 
distribntion  and  allow  residnal  pixels  with  the  smallest  contribntions  to  recognition 
to  be  coded  last. 

An  example  comparing  colnmn-major  and  distance  ordering  is  shown  in  Fignre 
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Figure  5.1:  Example  of  a)  prototype,  b)  cluster  member,  c)  residual  map,  and  d) 
distance  transform  of  prototype  image. 


(b) 


Figure  5.2:  Coding  of  residual  in  Figure  5.1  for  a)  column-major  ordering  and  b) 
distance  ordering. 

5.2.  The  column-major  order  results  in  residual  components  being  spread  throughout 
the  code.  In  the  distance  order  the  relative  probability  of  a  residue  increases  as  the 
code  progresses  (from  left  to  right,  provided  the  clustering  is  correct),  as  can  be  seen 
in  Figure  5.2b.  The  ordering  results  in  entropy  reduction  during  compression  and 
allows  us  to  hrst  transmit  pixels  which  contribute  the  most  to  symbol  recognition. 

The  quality  of  the  clustering  algorithm  affects  the  residuals’  complexity;  there 
may  exist  components  that  have  been  matched  to  prototypes  of  different  shapes. 
Figure  5.3  shows  an  example  of  this  situation  where  a  ‘w’  was  classihed  as  a  ‘v’.  The 
distance-ordered  code  is  complex.  Figure  5.4  shows  the  progressive  reconstruction  of 
this  component  by  using  fractional  portions  of  the  code.  It  can  be  seen  that  from  the 
start  of  the  code,  structural  components  that  differentiate  a  ‘v’  from  a  ‘w’  are  being 
formed,  aiding  in  the  correct  recognition  of  the  component.  It  can  also  be  seen  that 
this  residual  map  does  not  fall  into  the  same  category  as  the  residual  map  shown  in 
Figure  5.1.  From  a  degradation  point  of  view,  there  exists  a  large  amount  of  noise  far 
from  edge  pixels  and  the  distribution  does  not  ht  the  model.  Similar  results  would 
be  obtained  whenever  a  ‘w’  is  misclassihed  as  a  ‘v’. 
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Figure  5.3:  Example  of  inter-class  observation:  a)  cluster  prototype,  b)  observation, 
c)  residual  map,  and  d)  distance-ordered  residual  stream. 


V/V/W 

(a)  (b)  (c) 

Figure  5.4:  Structural  contribution  of  the  residual  stream  shown  at  reconstruction  of 
a)  30%  b)  60%  and  c)  90%  of  the  residual  stream. 
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Figure  5.5:  a)  Low,  b)  medium,  and  c)  high  structural  impact. 

5.3  Structural  coding 

In  the  previous  section  we  hypothesized  that  residual  pixels  that  are  farthest  from  the 
prototype’s  edge  contribute  the  most  to  the  correct  recognition  of  the  component.  In 
coding  the  residual,  we  ordered  the  pixels  by  their  distances.  We  can  do  better  if  we 
take  into  account  the  structure  of  the  residual,  as  discussed  in  this  section. 

5.3.1  Continuous  ordering 

The  structure  of  a  residual  involves  more  than  its  distance  ordering.  Structure  is 
a  hierarchical  concept;  a  group  of  pixels  has  structure,  while  a  single  pixel  can  only 
contribute  to  structure.  We  will  not  attempt  to  give  a  general  dehnition  of  “structure” 
here,  but  we  will  suggest  some  principles  which  are  applicable  to  the  character  domain. 
To  illustrate  these  principles,  consider  the  three  examples  in  Figure  5.5.  All  three 
cases  show  about  the  same  number  of  residual  pixels,  and  include  pixels  that  have 
large  distances  from  the  component.  In  5.5a,  the  pixels  are  randomly  scattered;  in 
5.5b  they  are  tightly  clustered  to  form  an  almost  connected  group;  in  5.5c  they  form 
a  group  which  is  almost  connected  to  the  prototype.  They  clearly  exhibit  a  varying 
degree  of  structural  impact. 

We  have  experimented  with  a  simple  residual  encoding  algorithm  based  on  the 
fact  that  structurally  signihcant  residual  pixels  are  often  connected  to  the  prototype. 
Our  algorithm  is  based  on  distance-ordered  transmission.  When  we  transmit  the  hrst 
residual  pixel  (farthest  from  the  prototype),  we  connect  it  (by  a  digitized  straight 
line  segment(s))  to  the  pixel(s)  of  the  prototype  closest  to  it,  yielding  an  augmented 
prototype  that  now  includes  the  original  residual  and  the  line  segment (s).  The  resid¬ 
ual  too  then  changes;  f’s  that  belong  to  the  augmented  prototype  now  need  not  be 
transmitted,  but  O’s  that  belong  to  it  (i.e.  pixels  on  the  line  segment  that  have  value 
0  in  the  image)  are  now  residual  pixels  and  need  to  be  transmitted.  As  we  continue  to 
transmit  residual  pixels  in  (the  original)  distance  order,  we  connect  them  to  the  (orig¬ 
inal)  prototype  with  line  segments  consisting  of  f’s  or  O’s;  when  the  latter  occurs,  it 
may  cause  f’s  to  change  back  to  O’s.  Note  that  such  changes  affect  only  pixels  that  are 
closer  to  the  prototype;  thus  the  distances  to  residual  pixels  never  increase,  and  the 
process  eventually  terminates  when  the  closest  residual  pixels  have  been  transmitted. 
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Figure  5.6:  Three  examples  of  structural  coding  using  various  percentages  of  the 
maximal  distance. 

The  results  of  using  this  simple  algorithm  show  that  (at  least)  the  widths  of  struc¬ 
tural  elements  are  captured  correctly.  Figure  5.6  shows  three  examples  of  components 
represented  by  using  a  percentage  of  the  maximal  distance.  The  hrst  example  is  an 
‘m’  component  that  was  misclassihed  as  an  ‘n’  prototype.  The  original  residual  map 
consisted  of  a  structural  component  that  was  connected  at  the  top  of  the  prototype. 
The  representation  of  this  component  using  residual  pixels  having  distances  within 
25%  of  the  maximal  distance  shows  a  large  number  of  errors  in  structure  estima¬ 
tion.  However,  subsequent  transmissions  correct  the  estimation  and  produce  a  shape 
close  to  that  of  the  component.  The  second  example,  an  ‘r’  misclassihed  as  a  dotless 
‘i’,  benehts  a  great  deal  from  the  structural  ordering;  from  the  onset  of  coding,  the 
represented  shape  is  very  close  to  the  original  component.  The  third  example,  a  ‘y’ 
misclassihed  as  a  ‘v’,  does  not  take  as  great  advantage  of  the  structural  ordering. 

5.3.2  Packet-mode  ordering 

An  enhanced  variation  of  this  approach  transmits  the  residual  pixels  in  distance- 
ordered  packets  (of  user-dehned  size).  When  the  hrst  packet  is  transmitted,  the 
closest  pixels  in  each  connected  component  of  the  packet  are  joined  to  the  prototype 
by  straight  line  segments.  This  dehnes  an  augmented  prototype  and  a  new  residual; 
the  next  packet  can  then  be  transmitted.  Fxamples  of  coding  based  on  packet-mode 
ordering  are  shown  in  Figure  5.7  for  packet  sizes  of  tOO  and  60  pixels.  The  size  of 
the  packet  should  be  based  on  the  size  of  the  component  and  should  be  large  enough 
that  a  packet  can  contain  residuals  having  a  range  of  distances,  and  small  enough  to 
allow  a  number  of  structural  coding  steps  to  take  place. 

The  ethciency  of  the  coding  using  packet  mode  ordering,  in  terms  of  both  com¬ 
pression  and  representation,  falls  between  those  of  continuous  ordering  and  distance- 
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Figure  5.7:  Three  examples  of  packet-mode  structural  coding  using  various  percent¬ 
ages  of  the  maximal  distance,  and  using  tOO-pixel  (middle  row)  and  60-pixel  (top  and 
bottom  rows)  packets. 


ordered  transmission,  since  a  packet  of  size  t  implies  continuous  ordering,  and  a  packet 
size  as  big  as  the  prototype  implies  distance  ordering.  For  intermediate  values,  our 
results  show  that  both  entropy  and  rendition  quality  are  more  favorable,  especially 
for  large  components. 

5.4  Analysis 

It  is  desirable  to  analyze  the  residual  code  to  determine  the  expected  amount  of 
compression  that  may  be  achievable  by  these  and  similar  coding  methods.  This  can 
easily  be  done  by  either  measuring  the  entropy  as  the  alphabet  approaches  inhnity, 
or  by  using  probabilistic  measures.  In  this  section  we  will  pursue  both  approaches  by 
setting  up  a  simple  model  and  then  extending  it  to  our  coding  method. 

5.4.1  Entropy 

Distance  ordering  has  a  number  of  desirable  properties,  with  the  compactness  of  the 
code  being  the  most  signihcant.  The  characteristics  and  complexity  of  the  residual 
map  are  directly  related  to  the  clustering  algorithm.  For  a  clustering  algorithm  that 
is  optimized  based  on  the  degradation  effects,  the  residual  pixels  can  be  characterized 
as  having  a  distribution  consistent  with  the  degradation  model.  For  unoptimized 
clustering  algorithms  residual  maps  might  contain  signihcant  structural  components. 
Assuming  that  the  clustering  algorithm  is  consistent  in  making  cluster  assignments, 
residual  maps  that  contain  a  signihcant  amount  of  structural  information  will  have 
consistent  mappings.  In  general,  a  clustering  algorithm  that  can  group  similar  shapes 
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Uniformly  Random  { 1 ,2, 3 ,4 }  Ordered  Code  O 

Figure  5.8:  System  realization  of  unordered  and  ordered  binary  code  obtained  from 
four  independent  and  statistically  different  sources. 

will  be  able  to  capitalize  on  the  consistent  distribution  of  residual  pixels,  particularly 
for  residual  maps  that  have  signihcant  structural  components. 

Up  to  now,  the  consistent  distribution  of  the  residual  pixels  has  been  shown  only 
by  observing  the  resulting  code  and  verifying  that  the  beginning  of  the  code  has 
consistently  contained  zero  values  in  components  that  are  clustered  well.  We  can, 
however,  do  better.  Assume  that  our  residual  pixels  are  samples  of  binary  sources 
with  varying  distributions  that  depend  on  the  distance  model  (i.e.,  residual  pixels 
having  distance  d  from  an  edge  are  samples  of  the  same  binary  source).  Since  com¬ 
ponents  belonging  to  the  same  cluster  are  not  necessarily  of  the  same  size,  the  row 
or  column  ordering  of  the  residual  pixels  can  be  regarded  as  a  random  selection  from 
the  binary  sources.  The  distance  ordering  is  an  attempt  to  improve  this  random  se¬ 
lection.  A  hypothetical  scenario  with  only  four  binary  sources  is  shown  in  Figure  5.8. 
In  reality,  the  number  of  sources  is  unknown  and  the  randomizer  R  is  non-uniform 
and  dependent  on  the  prototypes.  Let  the  unordered  code  be  denoted  by  U  and  the 
ordered  code  by  O. 

To  quantify  the  compactness  of  O  over  U  we  need  to  determine  their  entropies. 
The  entropy  of  a  discrete  binary  signal  B  can  be  calculated  by 

HeiB]  =  Y. -PbW  H2{PB{i))  (5.1) 

bets 

where  B  is  the  binary  space  {0, 1},  and  PB{b)  is  the  distribution  of  the  samples  based 
on  the  binary  signal  B.  This  measure  of  entropy  is  valid  for  a  binary  signal  that 
is  stationary  and  independent.  For  signals  involving  dependency  or  non- stationary 
processes,  this  measure  does  not  show  the  true  entropy.  Consider  the  entropy  of  a 
discrete  signal  X  with  samples  in  A, 

HAX)  =  [log,  (^) 
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(.5.2) 


For  X  =  {0, 1}^  we  achieve  the  hrst-order  entropy  which  is  identical  to  the  entropy  of 
(5.1),  and  for  X  =  {0, 1}”  we  achieve  the  nth-order  entropy,  In  other  words, 

a  signal  X  which  consists  of  samples  belonging  to  an  alphabet  of  size  |{0, 1}”|  =  2” 
needs  at  least  Hn{X)  bits  per  sample  (in  alphabet)  to  be  coded  without  any  loss. 
Then  for  the  n-bit  alphabet,  the  compression  factor  is  ^  (the  inverse  of  the 
entropy  rate).  For  the  stationary  and  independent  random  variable,  the  entropy  rate 
stays  constant,  but  for  our  ordered  binary  source  the  entropy  rate  decreases  as  the 
order  increases.  For  these  signals  the  true  entropy  is 

H{X)  =  lim  -HJX)  (5.3) 

n^oo  77, 

5.4.2  Simulation 

Computing  the  entropy  for  U  and  O  is  not  easy  since  in  calculation  of  Hn{X)  we 
need  to  estimate  the  distribution  of  symbols  in  an  alphabet  of  size  of  2”,  which  is  very 
large.  For  example,  for  n  =  15  the  alphabet  size  grows  to  30,000  symbols,  and  a  good 
estimate  of  the  probability  of  each  symbol  requires  a  sample  of  size  at  least  3  million. 
For  small  values  of  n,  we  simulate  the  entropy  of  U  and  O  as  shown  in  Figure  5.8, 
using  four  binary  sources  with  the  probability  assignments  of  {ihi,  5'2,  ihs,  5'4}  shown 
in  Table  5.1.  We  can  then  calculate  the  entropy  rates  shown  in  Figure  5.9.  Note 
that  the  hrst  order  entropy  rates  of  both  codes  are  the  same  since  P(l)  =  1  —  T’(O) 
is  equal  to  the  number  of  I’s  divided  by  the  total  number  of  samples,  and  they  are 
equal  regardless  of  the  ordering  mechanism.  Note  also  that  the  entropy  rate  of  the 
unordered  code  remains  reasonably  constant  (with  an  average  of  0.97),  implying  a 
stationary  and  independent  process  (this  is  true  because  of  the  random  selector  R). 
The  entropy  rate  of  the  ordered  code  is  lower  than  that  of  the  unordered  code  (in  fact, 
lower  than  0.75),  but  the  exact  value  is  not  certain.  The  reason  for  the  lower  entropy 
is  that  the  hrst  quarter  of  the  O  code  usually  has  zero  entropy  (see  Table  5.1)  and 
the  average  entropy  of  the  code  (the  average  of  the  entropies  in  the  H[S)  column) 
is  0.65.  The  probability  assignments  shown  in  Table  5.1  were  chosen  to  involve  large 
differences  between  sources  and  to  allow  for  a  variation  in  the  measured  entropy  rates 
of  the  unordered  and  ordered  codes. 

5.4.3  Empirical  analysis 

Next,  we  empirically  derive  the  entropy  of  U  and  O  to  arrive  at  a  measure  of  improve¬ 
ment.  We  can  hrst  analyze  the  system  of  Figure  5.8  and  generalize  H(U)  and  H[0) 
for  the  case  of  residual  pixel  ordering.  Consider  the  four  binary  sources  {Ni,  82^  Ns,  N4} 
with  associated  probability  distributions  of  {Psj  (s),  Ps^  (-s),  P53  (s),  Ps^  (s)  |  s  G  {0, 1}}. 
Using  (5.1)  we  can  calculate  the  entropy  of  each  source  separately  as  {i7(Ni),  P[[S2), 
//(Ns),  i7(N4)}.  It  can  be  easily  shown  that  these  are  true  entropies  (stationary  and 
independent  sources)  and  remain  constant  as  n  approaches  inhnity.  Since  the  selector 
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Sonrce 

P(0) 

P(l) 

H(S) 

1 

0 

0 

^2 

0.7 

0.3 

0.88 

^3 

0.5 

0.5 

1.00 

^4 

0.2 

0.8 

0.72 

S 

0.6 

0.4 

0.97 

Table  5.1:  Probability  assignments  nsed  to  generate  code  for  U  and  O. 


R  is  nniform,  Pnir)  =  the  probability  distribntion  of  the  bits  in  U  is 

Pu{u)  =  Er  [Psh(?^)] 

r=l 

=  jEftvl”)  (5.4) 

^  r=l 

resnlting  in  an  entropy  of 

H{U)  =  Pu{u)  X  \og2{Pu{u))  (5.5) 

u=0 

The  entropy  of  O  is  a  measnre  of  the  complexity  of  U  given  R.  In  this  way  the  extra 
complexity  of  R  is  removed  from  the  code  and  a  lower  entropy  resnlts.  By  dehnition 
[21] 


H(0)  =  H{U\R) 

=  =  r)Pi^(r) 

r  =  l 

=  ER[HiU\R)]  (5.6) 

The  complexity  of  a  sample  of  U,  given  that  the  mnltiplexed  sonrce  was  obtained 
from  if,  is  the  complexity  of  Sr-,  and  P[{U\R  =  r)  =  p[{Sr)-  The  entropy  of  the 
ordered  code  O  is  therefore  the  expected  valne  of  the  entropies  of  the  mnltiplexed 
binary  sonrces: 


H{0)  = 

r=l 

=  Er[H{Sr)]  (5.7) 

ft  is  important  to  note  that  the  entropy  indicates  the  maximnm  compression  possible. 
While  there  exists  no  general  algorithm  that  achieves  this,  there  are  algorithms  which 
can  get  arbitrarily  close  [21],  and  when  the  probabilities  are  powers  of  2,  the  Hnffman 


51 


Figure  5.9:  Entropy  rate  of  U  (o’s)  and  O  (*’s)  as  function  of  n. 

code  achieves  the  lowest  possible  bit  rate.  More  important  is  the  fact  [2f]  that  if  an 
algorithm  can  achieve  a  rate  of  H(U)  +  e  for  some  e  >  0,  then  it  can  also  achieve 
a  rate  oi  H{0)  -\-  e.  We  can  therefore  use  the  difference  between  these  entropies  to 
measure  the  expected  improvement.  This  difference  is  given  by 


H(U)-H(0)  =  H{U)-H{U\R} 

=  I{U]R) 

=  HY.  Pur{u,  r)  log 

tiew  reK 


PuR{u,r) 

^  Pu{u)PR{r) 


The  joint  probability  of  U  and  R  is  PuR(u^r)  =  Pu{u)  X  Tfl(r)  since  they  are  inde¬ 
pendent.  Pu(u)^  dehned  in  (5.4),  can  be  simplihed  to 


fi)  =  7  E  E  PsM  '<>82  4^^ 


Y.Ps.{>) 


For  the  system  shown  in  Figure  5.8  the  probabilities  {Ps^{s)^  ...^Ps^{s)}  are  specihed 
in  Table  5.1.  The  entropies  can  be  calculated  from  (5.5)  and  (5.7)  and  the  improve¬ 
ment  can  be  calculated  from  the  mutual  information  of  (5.9).  The  calculated  entropy 
of  U  is  P[{U)  =  0.971,  the  entropy  of  O  is  P[{0)  =  0.6508,  and  the  mutual  informa¬ 
tion  is  I{U]  R)  =  0.3202.  This  translates  into  an  improvement  of  almost  33%,  which 
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implies  that  on  the  average  one  can  expect  to  need  33%  less  storage  for  coding  O 
versns  coding  U .  In  other  words,  if  a  compressor  were  to  compress  U  to  fOO  bits, 
compressing  O  wonld  yield  67  bits  on  the  average. 

To  determine  the  improvement  resnlting  from  distance-based  ordering,  we  need 
to  consider  an  arbitrary  probability  distribntion  Tfl(r)  where  R  is  the  distance  valne 
indexed  by  the  distance  transform  map.  By  examining  an  image  and  connting  the 
distance  indexes  nsed  for  ordering  the  residnal  pixels  of  all  components,  we  can  esti¬ 
mate  Pu^r).  For  the  distribntion  of  binary  sonrces,  all  residnal  pixels  are  considered 
and  for  each  distance  R  =  r,  we  estimate  Ps,.{s).  To  nse  these  distribntions  in  onr 
analysis  we  relax  the  reqnirement  of  a  hxed  nnmber  of  binary  sonrces  and  a  nniform 
selector.  Given  PR(r),  Ps^{s)^  and  a  maximnm  observed  distance,  Rmax,  we  calcnlate 

Pu{u)  =  Er  [Psh(i<)] 

Rmax 

=  Y.  PrMPs,(’‘)  (5.10) 

r=l 

H{U)  =  -i2Pu{^)'^^g2M^)  (5-11) 

s=0 

H{0)  =  H(U\R) 

=  ERiHiSn)] 

Rmax  1 

=  -Y  PRi'-)YPsM)'«iAPsA‘>))  (6.12) 

r  =  l  s=0 

and 

I{U]R)  =  H{U)-H{U\R) 

1  Rmax  /  p  o')  \ 

.=0  r=l  V  Puys)  J 

Cnmnlative  event  connts  are  nsed  to  calcnlate  an  average  T’fl(r)  and  Psri-^)  over  a  set 
of  images  to  get  a  better  estimate  for  P[(U)^  P[(U\R)^  and  I(U]R).  Unfortnnately, 
since  we  have  only  a  limited  nnmber  of  residnal  pixel  observations,  especially  for 
large  distances,  this  method  may  resnlt  in  artihcially  inflated  probability  valnes  and 
may  yield  a  skewed  measnre  of  entropy.  Nevertheless,  in  practice  we  have  observed 
improvements  in  excess  of  f0%,  as  will  be  shown  in  Section  5.5. 

To  calcnlate  empirical  valnes  of  entropy  from  (5.f  0)-(5.f  3),  we  estimate  P{R)-,  the 
probability  of  the  pixel  sample  belonging  to  sonrce  if,  and  Pr{u),  the  probability  of 
sonrce  r  having  valne  u.  By  connting  how  many  distance  valnes  r  G  {1, ...,  if^aa-} 
were  observed,  we  can  estimate  P{R)  by  dividing  the  connts  by  the  snm  of  all  connts. 
Fignre  5.12a  shows  the  overall  distribntion  of  if,  calcnlated  cnmnlatively  over  120 
images.  Calcnlating  Pr{u)  reqnires  connting  I’s  and  O’s  separately  as  a  fnnction  of 
associated  distance  valne  r.  The  plot  of  =  1)  is  shown  in  Fignre  5.12b.  Using 
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our  distance  model  we  expect  decreasing  probability  as  r  increases.  This  is  observed 
primarily  for  distances  less  than  2000.  The  reason  for  the  unusually  high  values  of 
the  probabilities  observed  for  r  >  2000  is  the  low  number  of  observations  at  those 
distances. 

5.5  Results 

To  test  the  effectiveness  of  our  structural  coding  methods,  we  used  the  scanned  images 
in  the  database  and  performed  symbolic  compression  [54].  We  extracted  the  resid¬ 
ual  information  associated  with  non-NULL  prototypes  (NULL  prototypes  are  those 
classihed  as  graphics;  see  [50,  54]),  and  coded  the  residuals  based  on  the  distance, 
continuous,  and  packet  mode  orderings  (using  packet  sizes  of  20).  We  measured  their 
entropies  by  simulating  the  entropy  rate,  calculating  the  conditional  probability  den¬ 
sities,  and  using  (5.10)-(5.13).  Figure  5.10  shows  a  set  of  prototypes  obtained  from 
the  symbolic  compression,  as  well  as  the  corresponding  residuals.  The  residual  coding 
methods  discussed  in  this  paper  were  applied  to  the  residuals  shown  in  Figure  5.10d, 
which  constitute  the  majority  of  the  represented  information. 

We  hrst  present  the  results  of  the  entropy  rate  simulation  described  in  Section 
5.4.1  and  5.4.2  and  then  present  the  empirical  results  described  in  Section  5.4.3  for  the 
ordering  mechanisms  discussed  in  Sections  5.2,  5.3.1,  and  5.3.2.  Given  a  set  of  ordered 
residual  streams  we  calculate  the  entropy  of  (5.2)  for  X  =  {0, 1}”  and  increasing  n. 
Figure  5.11  shows  the  simulated  entropy  rates  up  to  order  n  =  10,  obtained  by 
extracting  the  residual  components  of  hve  document  images  (a  total  of  1  million 
residual  pixels),  ft  is  obvious  that  the  limit  of  the  entropy  rate  of  the  ordered  code 
cannot  be  determined  from  this  graph,  but  it  can  be  estimated  using  our  empirical 
tools.  Note  that  the  unordered  code  exhibits  a  downward  trend  which  implies  the 
existence  of  a  higher-order  process.  The  reason  for  this  is  that  the  “ordered”  results 
were  actually  based  on  row-ordered  code,  and  the  components  exhibited  correlation 
between  the  hrst  rows  of  residuals  belonging  to  the  same  prototype. 

Figure  5.13  shows  a  plot  of  entropies  for  120  document  images  and  Table  5.2 
summarizes  the  empirical  results.  To  estimate  the  overall  improvement,  H(U\R)  for 
each  ordering  mechanism  was  compared  to  H(U)  for  row  ordering,  yielding  the  results 
of  Table  5.3,  where  17  to  27  percent  improvement  was  observed  across  all  coding 
mechanisms  and  entropy  measurements.  Figure  5.14  shows  the  entropies  of  the  codes 
obtained  by  the  continuous  and  packet-mode  ordering  methods.  The  horizontal  lines 
denote  the  average  and  cumulative  values  of  the  entropy  for  the  unordered  codes. 

5.6  Summary 

We  hnd  that  the  most  important  contribution  to  the  residuals  is  due  to  symbol  degra¬ 
dation.  By  extending  existing  degradation  models  [48,  91,  72]  we  arrive  at  a  model 
that  takes  residual  pixels  less  likely  to  be  noise  as  informative  pixels.  By  ordering  the 
pixels  in  decreasing  order  of  distance,  we  are  able  to  classify  the  residual  pixels  into 
groups  that  have  relatively  stationary  distributions.  A  model  using  independent  bi¬ 
nary  sources  is  used  to  justify  the  distance  ordering.  We  have  shown  that  a  code  that 
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a  a  a  S  a  Pa  a 
The  ncore  CigaMa  architec¬ 
ture  and  the  Stanford  Paradigm  pro'ect 
both  use  a  hierarchy-of-buses  approach 
to  achieve  scalab'lity.  t  the  top  level, 
the  ncore  C'gaMax  is  composed  of 
several  clusters  on  a  global  bus.  ach 
cluster  consists  of  several  processor 
modules,  main  memory,  and  a  cluster 
cache.  The  cluster  cache  holds  a  copv  of 
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Figure  5.10:  Example  of  a)  a  set  of  prototype  maps,  b)  symbolic  representation,  c) 
components  that  were  classified  as  graphics  (“NULL”  prototypes),  and  d)  residuals 
of  components  that  were  assigned  to  non-NULL  prototypes. 


Cumulative 

H{U) 

H{0) 

/(U;  R) 

Distance 

.3407 

.2828 

.0579 

Structural 

.3226 

.27t4 

.05t2 

Packet 

.3224 

.27t2 

.05t2 

Average 

H{U) 

H{0) 

/(U;  R) 

Distance 

.3626 

.2980 

.0646 

Structural 

.3244 

.2636 

.0608 

Packet 

.3246 

.2637 

.0608 

Table  5.2:  Summary  of  entropy  values,  in  cumulative  and  average  versions,  for  all 
coding  methods. 
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Figure  5.11:  Entropy  rates  for  unordered  and  distance-ordered  coding. 


(a) 


Distance  (pixels) 

(b) 


Figure  5.12:  Probability  distribution  function  of  a)  selector  used  for  ordering  the  code 
and  b)  the  binary  distribution  for  each  selection  as  calculated  cumulatively  over  120 
document  images. 
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Figure  5.13:  Comparison  of  codes  in  terms  of  information:  a)  entropy  of  the  unordered 
code  [/,  b)  entropy  of  the  ordered  code  O,  c)  improvement  in  entropy  (the  mutual 
information  between  the  unordered  and  ordered  codes). 


Percent  improvement 


Cumulative 

Averaged 

Distance 

17% 

18% 

Structural 

20% 

27% 

Packet 

20% 

27% 

Table  5.3:  Improvements  in  coding  methodologies  with  respect  to  unordered  code. 


Entropy  of  continuous  ordered  code  w.r.t.  unordered  code 


Entropy  of  packet-mode  ordered  code  w.r.t.  unordered  code 
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(a) 
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Figure  5.t4:  Fntropy  of  a)  continuous  and  b)  packet-mode  structural  coding,  along 
with  their  cumulative  and  average  values  and  the  cumulative  and  average  values  of 
entropy  for  unpredicted  and  unordered  code. 
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does  not  use  the  probability  assignments  has  higher  entropy  and  less  compressibility. 
Furthermore,  the  group  distributions  need  not  be  known;  the  fact  that  statistically 
similar  pixels  are  grouped  together  is  sufficient  to  provide  higher  compression.  An 
important  characteristic  of  this  method  of  residual  coding  is  that  the  distance  order¬ 
ing  depends  only  on  the  prototypes  and  is  therefore  available  to  both  the  transmitter 
and  receiver. 

The  notion  of  structure  has  been  considered  as  a  method  of  assessing  the  con¬ 
tributions  of  pixels  to  structural  components.  Distance-ordered  coding  provides  an 
important  basis  for  predicting  this  structure.  A  predictive  method  is  suggested  that 
makes  use  of  the  fact  that  structural  components  are  connected.  Any  pixel  far  away 
from  an  edge  makes  a  high  contribution  to  readability;  but  connected  sets  of  pixels 
form  structural  components.  Packet-mode  prediction,  which  can  be  regarded  as  a 
tradeoff  between  distance  ordering  and  predictive  ordering,  provides  a  more  flexible 
approach  than  pixel-by-pixel  prediction. 

As  a  generalization  of  our  method,  if  a  probabilistic  model  of  the  residual  infor¬ 
mation  can  be  determined  for  a  lossy  compression  algorithm,  it  should  be  possible  to 
beneht  from  our  coding  framework.  A  lossy  algorithm  provides  a  base  image  which 
can  then  be  used  to  estimate  the  residual  order  based  on  the  probability  model.  This 
can  apply  to  images  beyond  the  document  image  domain. 
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Chapter  6 


SCoDI:  Symbolic  Compression  of  Document  Imagery 


6.1  Introduction 

Symbolic  Compression  of  Document  Images  (SCoDI)  is  an  implementation  of  the 
techniques  described  in  the  previous  chapters  to  create  a  framework  for  document 
image  compression  using  the  symbolic  approach.  We  have  previously  discussed  the 
general  approach  and  its  data  representation  requirements  along  with  detailed  discus¬ 
sions  of  the  segmentation,  clustering,  and  residual  coding  tasks.  The  interrelationship 
between  these  components  needs  to  be  mentioned  in  order  to  arrive  at  a  comprehen¬ 
sive  description. 

During  segmentation  we  rely  on  connected  components  to  provide  a  set  of  patterns 
that  resemble  an  underlying  language  and  that  can  be  used  in  a  clustering  algorithm  to 
arrive  at  prototypes  which  resemble  elements  of  the  language.  Connected  components 
are  a  naive  approach  to  achieving  this  result.  In  Chapter  4  we  showed  that  obtaining 
good  prototypes  does  depend  on  the  segmentation  and  eventually  affects  compression 
and  processing  performance. 

In  our  system  we  use  a  pattern  matching  and  classihcation  algorithm  which  basi¬ 
cally  tests  the  amount  of  match  between  a  presented  pattern  and  one  that  is  stored 
as  a  cluster  prototype.  If  a  match  exists  then  we  assign  the  pattern  to  the  cluster; 
otherwise  we  create  a  new  cluster  and  let  the  prototype  be  equal  to  the  presented 
pattern.  A  large  number  of  matching  functions  are  considered  and  their  effect  on  the 
performance  of  our  system  is  considered. 

Finally,  the  task  of  residual  coding  is  considered.  Since  the  majority  of  the  com¬ 
pressed  information  is  in  the  residual  code,  it  is  necessary  to  address  concerns  about 
its  coding.  We  propose  methods  that  organize  the  residuals  in  a  way  that  is  compact 
(compressible),  hierarchically  organized  (compressible  and  processable),  and  associ¬ 
ated  with  their  symbolic  counterparts  (processable). 

In  implementing  a  system  that  can  take  advantage  of  our  representation,  we  must 
create  some  building  blocks.  The  categorization  of  these  blocks  provides  a  detailed 
road  map  for  future  research  and  enhancements  to  the  system,  since  they  form  the  fun¬ 
damental  architecture  for  our  approach.  The  overall  structure  is  shown  in  Figure  6.t. 
While  the  baseline  idea  is  a  derivative  of  pattern  matching  and  substitution  (PMS) 
algorithms  [tO,  64],  this  representation  provides  a  general  framework  for  achieving 
compression  while  allowing  for  compressed-domain  processing. 

In  this  chapter  we  will  discuss  the  implementation  issues  we  encountered.  Specif¬ 
ically,  we  discuss  segmentation  in  Section  6.2,  clustering  in  Section  6.3,  component 
representation  in  Section  6.4,  and  residual  coding  in  Section  6.5.  We  address  some 
compressed-domain  processing  requirements  in  Section  6.6,  and  we  summarize  our 
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Figure  6.1:  Overall  design  structure  for  symbolic  compression  of  images. 


implementation  in  Section  6.7. 

6.2  Segmentation 

The  first  step  in  our  method,  as  described  in  Section  2.f,  is  to  extract  patterns  from 
an  image.  Segmentation  is  a  generic  term  and  is  highly  dependent  on  image  type  and 
characteristics.  For  bi-level  document  images,  a  simple  and  effective  segmentation 
method  is  extraction  of  connected  components.  The  bounding  boxes  for  connected 
components  then  should  contain  the  patterns  that  repeat  in  the  document  image.  In 
our  approach  we  have  implemented  a  connected  component  analysis  routine  which 
scans  a  pixel  row  of  image  and  tags  runs  of  foreground  pixels.  The  same  process 
occurs  on  the  succeeding  scan-line  with  the  exception  that  tags  which  are  equal  across 
scanlines  are  combined.  Updating  corner  locations  provides  the  bounding  boxes. 

Other  segmentation  methods  can  be  pursued  to  extract  repetitive  patterns.  An 
example  of  such  a  technique  is  discussed  in  Chapter  3  where  given  a  set  of  prototypes 
we  re-evaluate  the  components  to  arrive  at  a  better  segmentation.  This  method  is 
especially  useful  for  degraded  images  where  components  have  started  to  touch  each 
other  and  the  representative  prototypes  are  images  of  touching  characters.  For  an 
extension  to  grayscale  images,  it  is  desirable  to  use  the  grayscale  information  to 
achieve  a  better  segmentation.  In  effect,  a  good  segmentation  is  one  that  provides 
fundamental  patterns  within  the  image  that  are  redundant.  In  terms  of  compressed- 
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domain  processing,  an  effective  segmentation  is  one  that  extracts  usable  information 
to  be  accessed  later.  In  the  case  of  document  images,  the  connected-component 
segmentation  allowed  for  extraction  of  redundant  patterns  and  usable  information. 

6.3  Clustering 

Once  a  suitable  segmentation  is  obtained,  a  clustering  algorithm  needs  to  be  used 
to  determine  similarities  between  the  extracted  patterns.  In  general,  a  clustering 
algorithm  needs  to  operate  on  the  segmentation  output  and  to  capitalize  on  pattern 
redundancy  while  behaving  robustly  when  inconsistencies  occur.  For  example,  a 
situation  that  occurs  in  document  image  segmentation  and  clustering  is  that  the 
extracted  patterns  were  of  different  sizes,  so  our  clustering  algorithm  was  required  to 
provide  a  degree  of  robustness  to  the  input  component  sizes.  A  number  of  clustering 
algorithms  are  considered  in  Chapter  3,  and  a  suitable  one  is  used  for  our  compression. 

In  our  implementation  we  input  the  bitmap  image,  the  bounding  boxes  of  the 
connected  components,  and  a  threshold  value  to  a  routine.  The  routine  scans  all  the 
components  and  matches  them  against  the  prototype  images.  Each  match  is  com¬ 
pared  to  a  threshold,  normalized  by  the  size  of  the  the  component  (i.e.  if  the  number 
of  matched  pixels  is  above  a  percentage  of  the  component  size,  the  component  is  clas- 
sihed  as  belonging  to  that  cluster).  If  a  close  enough  match  occurs  the  component 
is  classihed  in  the  corresponding  cluster,  otherwise  it  is  used  to  create  a  new  cluster. 
After  a  number  of  components  are  added  to  a  cluster,  the  cluster  prototype  is  recalcu¬ 
lated.  In  our  system  we  recalculate  prototypes  every  time  hve  components  are  added 
and  do  not  recalculate  after  the  tenth  calculation.  To  avoid  generating  a  large  number 
of  prototypes,  if  the  ratio  of  the  number  of  prototypes  generated  to  the  number  of 
components  visited  exceeds  by  hfty  percent  the  ratio  of  the  maximum  number  of  pro¬ 
totypes  to  the  total  number  of  components,  the  threshold  value  is  increased  linearly 
by  this  amount  and  the  clustering  is  repeated.  By  starting  with  a  small  threshold  we 
can  end  the  clustering  process  with  an  adequate  number  of  clusters. 

The  relationship  between  segmentation  and  clustering  is  more  complicated  than 
we  have  indicated  here.  In  our  baseline  approach,  clustering  is  dependent  on  seg¬ 
mentation,  but  as  hinted  in  the  previous  section,  segmentation  can  be  improved  if 
a  clustering  result  is  available.  In  Chapter  4  we  discussed  a  joint  segmentation¬ 
clustering  approach  which  not  only  improves  the  compression  results  but  promises  to 
improve  processing  performance.  In  essence  there  should  exist  a  higher-level  process 
overseeing  the  segmentation  and  clustering  tasks,  making  sure  that  they  access  each 
other’s  intermediate  results,  as  shown  in  Figure  6.2. 

6.4  Component  representation 

Once  we  have  determined  the  clusters,  it  is  relatively  easy  to  dehne  a  representative 
prototype  for  each  cluster.  In  our  method  we  averaged  the  members  of  each  cluster 
to  calculate  its  prototype  image.  Hobby  and  Baird  call  this  “naive  averaging”  [34]. 
Zhang  and  Danskin  [116]  use  an  alternative  method  based  on  scaling  and  hltering. 
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Joint  segmentation 
and  clustering 


Figure  6.2:  Joint  segmentation  and  clustering  relationship. 


Our  “naive  averaging”  has  produced  good  results  mostly  due  to  the  highly  structured 
character  layout  of  document  images. 

Given  the  prototypes,  we  need  to  represent  their  locations  within  the  document 
image.  To  preserve  access  to  the  prototypes  (i.e.,  to  the  components  that  are  rep¬ 
resented  by  them),  we  give  the  address  of  each  component,  as  described  in  Section 
2.2.f.  This  representation  is  somewhat  redundant.  A  highly  structured  document 
image  has  components  which  he  on  lines,  ft  is  therefore  much  simpler  to  give  relative 
addresses  of  components  rather  than  absolute  addresses.  Work  done  by  Witten  et  ah 
[112]  discusses  the  optimization  of  component  indexing  in  this  context;  however,  due 
to  the  small  storage  space  needed  to  record  the  addresses,  this  was  not  pursued,  ft  is 
also  possible  to  pursue  a  rate-distortion  analysis  with  regard  to  component  indexing, 
that  is  to  say,  provide  addresses  which  are  not  exact  and  allow  for  a  certain  level  of 
distortion,  while  achieving  a  higher  compression  ratio. 

6.5  Residual  coding 

ft  is  possible  to  achieve  very  high  compression  by  substituting  for  each  segmented 
patterns  its  associated  prototype.  In  document  image  compression,  this  is  effective 
to  a  great  extent;  however,  exact  rendition  (at  least  for  reading  purposes)  cannot  be 
guaranteed  since  there  always  exist  ambiguous  characters  which  are  assigned  to  the 
same  cluster.  Readability  is  not  effectively  achieved  if  too  many  ‘e’  characters  are 
replaced  by  ‘o’  characters,  even  though  we  can  effectively  read  the  ‘o’  characters.  To 
provide  graceful  degradation  in  reading  quality,  we  make  use  of  a  residual  code. 

In  our  experiments  we  observed  that  the  largest  amount  of  information  is  contained 
in  the  residual  representation.  To  achieve  a  more  efficient  representation  we  pursued 
a  model-driven  approach  to  achieve  compression  and  to  provide  graceful  degradation 
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when  residues  are  partially  represented.  This  is  logical  in  that  given  a  segmentation 
and  clustering,  each  prototype  represents  a  class  of  observations  which  is  dehned  by 
the  variability  in  the  prototype’s  cluster.  This  variability  is  hnite  and  has  a  specihc 
shape,  so  for  a  class  of  images  (to  be  compressed)  it  is  possible  to  model  the  variations 
effectively  to  provide  a  hierarchical  representation  for  use  in  graceful  degradation  of 
tasks  like  lossy  representation  and  progressive  transmission.  In  our  implementation 
we  take  as  input  the  original  bitmapped  image,  the  prototypes,  and  the  component 
memberships  and  create  a  stream  in  which  residuals  of  components  are  appended 
and  indexed  according  to  their  order  of  appearance.  First  the  prototypes  are  used  to 
generate  their  distance  transforms,  and  these  distance  transforms  are  used  to  order 
the  residuals  for  distance-based  coding.  For  structural  coding,  we  used  the  residuals, 
prototypes,  and  component  memberships  as  inputs  to  Matlab  routines  and  measured 
their  progressive  entropy  by  coding  the  residuals  based  on  continuous  and  packet¬ 
mode  coding.  Special  implementation  of  residual  coding  is  done  to  allow  for  packing 
of  bits  into  bytes  while  keeping  track  of  where  the  residual  information  for  components 
starts  and  hnishes,  without  wasting  bits  and  keeping  the  information  accessible  after 
performing  the  Huffman  code.  This  is  done  by  keeping  a  byte  and  a  bit  counter  for 
each  residual  set,  Huffman  coding  the  set,  and  recalculating  the  byte  and  bit  counter 
after  coding. 

While  the  complexity  of  the  residuals  is  dependent  on  the  quality  of  segmentation, 
clustering,  and  component  representation,  it  is  easy  to  see  that  their  relationships 
cannot  be  easily  modeled  to  ht  a  the  framework  shown  in  Figure  6.3.  The  only 
plausible  outlook  for  achieving  such  a  “Global  Manager”  is  to  model  the  effects  of 
individual  components  so  that  one  component  need  not  wait  for  the  effects  of  others 
to  propagate.  For  example,  a  segmentation  that  is  based  on  existing  clusters  should 
also  consider  the  number  and  complexity  of  the  residuals  and  the  component  layout  in 
order  to  make  an  optimal  determination  of  the  correct  segmentation.  If  a  global  model 
is  derived  that  can  be  trained,  a  global  manager  can  use  it  for  optimal  interaction 
between  components. 

6.6  Compressed-domain  processing 

Our  eventual  goal  is  to  perform  a  set  of  compressed-domain  processing  tasks.  In  our 
implementation  we  created  a  set  of  routines  with  generic  functions.  One  function  was 
to  compress  and  decompress  simple  streams.  A  second  routine  compressed  a  stream 
while  modifying  its  indexes;  this  usually  involved  simple  compression  of  the  indexes. 
A  third  routine  decompressed  only  a  section  of  the  stream.  Using  combinations  of 
these  routines  we  are  able  to  feed  appropriate  portions  of  the  information  to  specihc 
document  image  processing  tasks.  Since  we  are  able  to  divide  the  information  into 
logical  partitions,  we  have  access  to  individual  information  sections,  as  required  by 
the  processing  elements  and  by  the  need  to  perform  extra  decompression.  For  large 
information  sets,  we  are  able  to  index  them  so  that  unnecessary  decompression  and 
processing  is  minimized. 

Since  most  of  the  information  in  document  images  lies  in  the  components,  we  are 
able  to  achieve  compressed-domain  processing.  This  is  not  readily  obvious  for  other 
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Managed  Processes 


Figure  6.3:  Management  of  compression  modules  to  achieve  a  global  goal. 

classes  of  images.  However,  it  is  possible  to  identify  a  class  of  images  that  are  to 
be  used  for  a  specibc  purpose,  and  model  their  information  accordingly  so  that  the 
dissemination  of  information  is  done  efficiently.  Such  situations  abound  in  today’s 
networked  databases,  and  in  server-client  information  exchange  systems. 

6.7  Summary 

The  implementation  of  our  compression  system  has  been  discussed  in  this  chapter. 
Detailed  discussions  of  the  implementation  of  segmentation,  clustering,  residual  cod¬ 
ing,  and  component  representation  are  given  to  support  our  earlier  claims  about 
compression  and  processing.  At  every  step  the  implementation  is  discussed  together 
with  supporting  issues  and  concerns,  as  well  as  suggestions  for  improvement. 

We  discuss  and  explain  our  algorithmic  approach  to  connected  component  analysis 
and  bounding  box  calculation  for  segmentation,  ft  is  important  to  note  that  this 
analysis  is  ideal  for  binary  images  that  have  small  amounts  of  degradation.  Earlier  we 
discussed  alternate  methods  of  segmentation  under  heavy  degradation  and  examined 
how  these  methods  ht  into  our  compression  scheme.  We  observed  that  in  presence 
of  degradation  an  operations  manager  can  provide  control  over  better  segmentation 
and  clustering. 

Residual  coding  and  its  implement ational  concerns  is  also  discussed.  We  suggest 
that  for  an  optimized  system  interaction  between  subcomponents  of  the  compression 
system  needs  to  exist  in  order  to  allow  them  to  complement  each  other’s  weaknesses 
and  strengths.  Such  an  elaborate  system  is  only  possible  with  current  computa¬ 
tional  technology.  Finally,  we  discuss  the  general  implementation  issues  of  performing 
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compressed- domain  processing  tasks. 

Once  our  system  is  implemented,  we  can  consider  testing  its  performance  at 
compression  and  processing.  These  validations  include  rate-distortion  analysis  and 
compressed-domain  processing  applications. 
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Chapter  7 


Rate-distortion  analysis 


7.1  Introduction 

Two  common  problems  in  compression  research  are  achieving  lossy  compression  and 
providing  a  representation  for  progressive  transmission.  The  performance  of  any  solu¬ 
tion  is  characterized  by  a  rate-distortion  trade-off  in  which  we  attempt  to  optimize  the 
correct  recognition  of  image  contents.  Using  symbolic  compression  in  document  im¬ 
ages  we  have  shown  that  compression  can  be  achieved  while  allowing  for  compressed- 
domain  processing  [54].  While  the  ability  to  perform  progressive  transmission  and 
scalable  lossy  compression  has  also  been  demonstrated,  a  rigorous  rate-distortion 
trade-off  analysis  is  necessary.  Since  the  symbolic  representation  contains  distinct 
data  types  with  different  characteristics,  rate-distortion  analysis  should  be  performed 
on  all  parts  and  be  integrated  coherently  to  promote  a  comprehensive  measure.  We 
perform  the  analysis  here  only  on  the  residual  section  of  the  data,  where  about  80% 
of  the  data  is  contained. 

The  problem  that  must  be  addressed  is  how  to  dehne  an  appropriate  distortion 
function  for  document  images.  The  traditional  signal-to-noise  ratio  and  its  derivatives 
are  not  desirable.  For  a  constant  power  level  there  exist  different  representations  that 
render  varying  amounts  of  the  image.  Pixels  carry  varying  amounts  of  information 
which  affects  the  correct  recognition  of  the  image  components.  The  distortion  function 
should  reflect  the  correct  recognition  of  the  lossy  representations.  Using  an  OCR 
engine  and  OCR  evaluation  software,  we  were  able  to  link  the  correct  recognition  of 
an  image  with  an  associated  entropy.  This  was  also  compared  to  distance-weighted 
distortion. 

This  chapter  is  organized  as  follows.  In  Section  7.2  we  provide  a  brief  motivation 
for  our  approach.  In  Section  7.3  we  discuss  in  detail  our  procedure  and  results.  In 
Section  7.4  we  provide  similar  results  based  on  pixel-based  matching  (distance-based 
matching)  instead  of  the  OCR-based  approach.  We  conclude  in  Section  7.5  with  some 
hnal  remarks. 

7.2  Motivation 

In  the  previous  chapters  we  have  argued  that  since  characters  degrade  around  the 
edges  [48,  72,  91]  the  pixels  far  from  edges  contribute  more  toward  correct  recognition 
than  pixels  close  to  edges.  This  was  also  supported  by  the  fact  that  most  degraded 
images  are  still  recognizable.  It  is  however  not  trivial  to  extend  this  concept  to  address 
rate-distortion  concerns.  This  is  because  the  concept  of  distortion  is  eventually  related 
to  correct  recognition  which  is  hard  to  measure  from  a  presented  image.  Here  we  use 
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Figure  7.1:  Example  of  a)  an  observation,  b)  a  residual  map,  c)  the  distance  map  of 
the  prototype,  d)  distance-ordered  residual  code,  and  e)  row-ordered  residual  code. 
The  codes  in  (d)  and  (e)  are  ordered  from  left  to  right. 

|[|||||||||||  (d) 

IJil  m  (e) 

(a)  (b)  (c)  - ^ 


Figure  7.2:  Example  of  a)  an  observation,  b)  a  residual  map,  c)  the  distance  map  of 
the  prototype,  d)  distance-ordered  residual  code,  and  e)  row-ordered  residual  code. 
The  codes  in  (d)  and  (e)  are  ordered  from  left  to  right. 


OCR  to  provide  a  measure  of  correct  recognition  of  components  from  a  presented 
image. 

7.2.1  Hierarchy 

A  typical  observation,  residual  map,  distance  transform,  distance-order  code,  and  row- 
order  code  are  shown  in  Figures  7.t  and  7.2.  It  is  easy  to  see  that  the  residual  pixels  in 
Figure  7.1b  do  not  contribute  signihcantly  to  recognition,  whereas  the  residual  pixels 
in  Figure  7.2b  do.  For  the  distance  order,  the  residuals  of  Figure  7.1b  are  coded 
towards  the  end  of  the  stream,  as  shown  in  Figure  7. Id,  while  for  the  row-ordered 
stream  they  are  spread  over  the  entire  stream  (Figure  7.1e).  For  the  distance-ordered 
residual  of  Figure  7. 2d,  the  pixels  which  distinguish  the  ‘C’  from  the  ‘G’  are  coded 
hrst,  in  contrast  to  the  row-ordered  residual  code  of  Figure  7.2e  which  contains  these 
pixels  in  the  middle  of  the  code.  Examples  of  residuals  are  shown  in  Figure  7.3. 
Letters  ‘E’  and  ‘i’  were  classihed  as  graphics,  ‘G’  was  classihed  in  a  ‘C’  prototype, 
and  the  rest  of  the  components  have  only  silhouette  residuals. 

Figure  7.4  shows  an  example  of  extreme  lossy/lossless  representations  with  Figure 
7.4a  requiring  a  lower  entropy  than  Figure  7.4b.  A  coding  method  interpolates  be¬ 
tween  these  extremes  to  provide  an  intermediate  representation  of  a  desired  entropy 
and  an  associated  distortion.  The  point  of  a  hierarchy  is  that  given  a  representation 


67 


Figure  7.3:  Examples  of  various  types  of  residuals. 
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Figure  7.4:  Section  of  document  image  rendered  with  (a)  almost  no  residuals  and  (b) 
almost  all  residuals. 

with  a  rate-distortion  pair  of  (r,  d)  it  is  desired  to  achieve  (r-|-^r)  entropy  and  (d  —  Sd) 
distortion  by  transmitting  only  information  of  Sr  entropy.  A  simple  hierarchy  is  cre¬ 
ated  by  traversing  from  left  to  right  in  the  distance-ordered  and  row-ordered  codes  of 
Figure  7.f .  For  an  intermediate  representation  we  use  a  fractional  part  of  the  residual 
code.  The  hierarchy  can  then  be  tested  by  comparing  the  rate-distortion  curves  of 
forward  and  reverse  ordering;  forward  ordering  should  clearly  perform  better  than 
reverse  ordering. 

7.2.2  Analysis 

Document  images  are  intended  for  reading  and  we  have  therefore  chosen  the  distortion 
to  be  a  function  of  recognition  as  dehned  by  OCR  accuracy.  We  hrst  need  to  determine 
the  baseline  text  by  considering  either  ground  truth  data  or  the  OCR  results  on 
the  lossless  image.  For  each  intermediate  representation  we  perform  OCR.  We  then 
evaluate  the  performance  of  the  OCR  output  with  respect  to  the  baseline  text  by 
string  matching  [17,  20,  56]. 

Figure  7.5  shows  examples  of  the  words  “GigaMax”  and  “copy”  in  three  different 
instances  yielding  almost  equal  entropies,  ft  is  clear  that  distance-ordered  code  (Fig¬ 
ures  7.5a  and  b)  is  likely  to  produce  fewer  OCR  errors  than  reverse  distance-ordered 
code.  Row-ordered  code  will  have  errors,  but  since  reconstruction  of  the  letter  ‘G’ 
has  started,  it  may  perform  better  than  reverse  distance-ordering.  Errors  are  not 
widespread  even  in  the  most  lossy  representation.  This  is  ideal  since  in  Chapter  8  we 
will  show  that  many  document  image  processing  tasks  can  be  done  without  the  use 
of  residuals  and  that  the  OCR  engine  will  be  working  at  high  accuracies  with  reliable 
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Figure  7.5:  Example  of  portions  of  an  image  (Figure  7.4)  with  the  use  of  roughly 
equal  entropy  in  lossy  rendition  using  (a,b)  distance  ordering,  (c,d)  row  ordering,  and 
(e,f)  reverse  distance  ordering. 


output.  OCR  evaluation  is  done  by  string  matching  by  use  of  available  software  which 
outputs  the  numbers  of  matched,  deleted,  inserted,  and  changed  symbols.  We  dehne 
the  inverse  of  distortion  (performance)  to  be  the  number  of  deleted,  inserted,  and 
changed  symbols  minus  the  number  of  matched  symbols. 

7.3  Procedure  and  results 

In  determining  the  rate-distortion  relation,  two  procedures  were  used.  The  hrst  was 
to  render  an  intermediate  representation  with  an  associated  entropy  and  to  perform 
OCR.  The  second  was  to  compare  the  OCR  outputs  with  the  ground  truth  data  and 
determine  the  OCR  distortion.  In  creating  the  intermediate  representations,  we  used 
the  University  of  Washington  [32]  images  and  represented  them  using  symbolic  com¬ 
pression.  Given  fractions  of  the  residual  codes  were  zeroed  (for  each  component), 
and  the  distribution  of  the  residuals  (number  of  residuals  and  number  of  zeros)  was 
computed.  Only  the  residuals  that  corresponded  to  components  assigned  to  a  pro¬ 
totype  were  used  in  the  intermediate  representation.  This  was  necessary  to  avoid 
contamination  by  components  that  were  not  clustered  appropriately.  The  temporary 
representation  was  then  fed  to  an  OCR  engine  (ScanWorx  by  Xerox,  API  software) 
and  the  text  output  was  saved  for  later  comparisons.  The  OCR  outputs  were  evah 
uated  using  software  available  in  the  University  of  Washington  database  [32].  The 
evaluation  was  done  with  respect  to  the  ground  truth  data  with  the  use  of  zones  to 
avoid  contamination  by  graphs  and  images.  For  each  representation  set  we  determined 
the  achieved  entropy,  and  using  the  evaluation  result  we  determined  a  rate-distortion 
relation. 

Since  a  linear  increase  in  the  fraction  of  residual  code  does  not  translate  into  a 
linear  increase  in  entropy  we  had  to  specify  fractional  amounts  that  would  provide  a 
better  distribution  of  points  on  the  entropy  axis.  This  was  done  for  distance  ordering 
only.  The  plot  of  distortion  with  respect  to  entropy  is  shown  in  Figure  7.6.  This 
result  was  obtained  on  72  images  containing  214,292  ground-truthed  characters.  The 


69 


*  -  Dist,  X  -  Rev.  Dist,  +  -  Row,  o  -  Rev.  Row 


Figure  7.6:  Rate-distortion  plot.  Symbol  denotes  distance  ordering,  ‘x’  denotes 
row  ordering,  and  denotes  reverse  distance  ordering. 


performance  of  distance  ordering  is  clearly  superior  to  that  of  the  other  codes.  An 
interesting  observation  is  that  distance  ordering  clearly  performs  better  than  reverse 
distance  ordering.  In  the  case  of  row  and  column  ordering,  that  their  reversals  perform 
neither  better  nor  worse.  This  suggests  that  row  or  column  ordering  does  not  provide 
an  adequate  representational  hierarchy.  The  fact  that  reverse  distance  ordering  does 
much  worse  than  forward  distance  ordering  suggests  that  there  exists  an  inherent 
representational  hierarchy.  Another  observation  is  that  in  Chapter  5  we  were  able  to 
achieve  an  18%  to  27%  improvement  in  entropy  over  distance  ordering  by  performing 
structural  coding.  The  expected  amount  of  improvement  is  shown  in  Figure  7.6  in 
the  form  of  dashed  lines.  The  dashed  line  only  reflects  the  improvement  in  entropy, 
but  structural  coding  should  also  improve  distortion  (OCR  on  distance-based). 

Figure  7.6  shows  the  relationship  between  distortion  level  and  achieved  entropy 
rate  as  partial  residual  coding  progresses  using  different  coding  schemes.  This  result 
indicates  the  range  of  coding  efficiency  achieved  by  the  coding  methods  considered 
here.  While  the  rate-distortion  function  cannot  be  derived  from  these  results,  it  can 
be  roughly  estimated.  The  rate-distortion  function  gives  the  minimum  distortion  at 
a  given  entropy  rate,  ft  is  unrealistic  to  consider  all  combinations  of  residuals  that 
have  the  same  entropy  and  determine  their  distortions,  due  to  the  large  amount  of 
processing  needed  for  residual  representation  and  for  OCR.  However,  each  bit  order 
can  be  used  to  model  the  best  possible  representation  given  an  entropy  rate.  All 
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possible  residual  codes  that  achieve  a  given  entropy  can  be  replaced  by  a  single  code, 
ordered  in  a  hierarchy,  that  has  that  entropy  rate.  The  possible  improvement  is  shown 
in  dashed  lines  in  Figure  7.6.  The  functions  shown  in  Figure  7.6  need  not  be  strictly 
convex  for  two  reasons.  The  hrst  reason  is  that  they  are  based  on  empirical  data. 
The  second  reason  is  that  the  distortion  is  averaged  over  a  large  number  of  images 
that  vary  in  quality;  while  convexity  may  be  observed  for  individual  images,  it  may 
be  lost  over  a  set  of  images. 

7.4  Distance-based  rate-distortion 

Distance-based  ordering  has  natural  extensions  to  lossy  compression  and  progressive 
transmission  applications.  In  the  case  of  lossy  compression,  the  residual  stream  can  be 
terminated  at  a  predehned  point.  Recall  that  in  the  previous  sections,  we  showed  that 
for  lossy  compression,  the  distance  order  residual  is  structurally  more  informative  (has 
a  lower  distortion)  and  is  more  compact  (has  lower  entropy  rate).  These  properties 
are  advantageous  in  progressive  transmission  as  well  as  lossy  compression.  For  each 
symbol  and  given  code  length  we  would  like  to  retain  maximal  quality  (i.e.  lower 
distortion). 

Figure  7.7  shows  a  typical  rate-distortion  function  achieved  for  a  sample  image. 
This  curve  shows  a  signihcant  change  in  distortion  per  unit  entropy  rate  for  entropy 
rates  up  to  O.OOfS  bits  per  residual  pixel.  An  example  of  a  lossy  compressed  compo¬ 
nent  was  shown  in  Figure  5.1b,  and  its  associated  residual  map  in  Figure  5.1c. 

The  rate-distortion  calculation  using  distance-based  matching  was  done  for  both 
the  unordered  and  ordered  (continuous  and  packet-mode)  codes.  The  distortion  mea¬ 
sure  used  was  the  same  as  the  distance-based  mismatch  measure  dehned  in  [54],  where 
the  total  distortion  is  measured  as  the  sum  of  distances  for  the  pixels  that  were  not 
coded.  The  entropy  rate,  r,  was  calculated  as 

H\(E)  X  Number  of  residual  pixels  , 

r  =  ^ ^ -  (7.1) 

image  size 

in  units  of  bits  per  pixel.  For  the  case  of  unordered  residual  coding,  the  pixels  are 
coded  in  row  order.  Figure  7.8  shows  the  rate-distortion  functions  for  unordered, 
distance-ordered,  continuous  and  packet-mode  structural  coding.  The  diagonal  line 
represents  the  unordered  code,  for  which  the  tradeoff  between  distortion  and  transmis¬ 
sion  rate  is  almost  a  constant.  For  distance  ordering,  we  coded  bits  that  contributed 
to  the  highest  distortion  hrst  and  calculated  their  rate  as  specihed  above.  Continuous 
coding  and  packet-mode  structural  coding  were  also  used  to  calculate  distortion  and 
achievable  entropy  rate.  Figure  7.8  shows  the  performances  of  these  coding  mech¬ 
anisms.  The  structural  methods  give  the  best  result,  slightly  better  than  that  of 
distance-ordered  coding.  Note  that  the  packet-mode  and  continuous  methods  are  not 
distinguishable  in  the  plot  shown  in  Figure  7.8. 

This  analysis  is  incomplete  in  that  the  measure  of  distortion  is  not  comprehensive 
enough.  The  primary  conclusion  from  Figure  7.8  is  that  distance  ordering  contributes 
much  more  toward  reduction  of  distortion  than  row  ordering.  This  is  obvious,  because 
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Figure  7.8:  Rate-distortion  plots  of  unordered,  distance-based,  continuous,  and 
packet-mode  structural  coding. 


the  distortion  is  based  on  the  distance  transform,  ft  remains  to  establish  that  the 
distortion  measure  used  in  Figure  7.8  measures  something  useful  such  as  the  reading 
quality  of  the  image  which  in  turn  affects  the  correct  recognition  of  individual  symbols. 
In  the  previous  section  OCR  errors  were  shown  to  be  less  for  distance  ordering  than 
row  ordering  for  the  same  entropy  rate.  It  follows  that  the  distortion  measure  used 
for  Figure  7.8  may  calculate  a  measure  close  to  reading  quality.  An  automatic  choice 
of  the  best  distortion-rate  pair  is  not  obvious  in  OCR-based  distortion,  but  is  quite 
easy  in  distance-based  distortion. 

7.5  Summary 

In  compression,  it  is  often  desirable  to  perform  lossy  compression  and  progressive 
transmission.  Traditionally,  algorithms  which  provide  these  capabilities  use  a  rate- 
distortion  plot  to  determine  the  exact  compression  factor  achieved  by  allowing  a  hxed 
amount  of  distortion  measured  by  the  signal-to- noise  power  ratio.  This  is  easy  to  do, 
since  the  signal  is  modeled  by  equiprobable  samples  and  the  amount  of  deviation  from 
the  model  is  taken  to  be  the  distortion  by  taking  a  simple  distance  measure. 

For  document  images  a  signal  model  which  can  distinguish  signal  from  noise  does 
not  exist.  While  a  document  image  is  a  bit-mapped  image,  the  useful  information 
resides  in  the  symbols  found  in  it.  Therefore  a  model  must  include  the  symbol  repre¬ 
sentation  of  the  document  image.  Here,  we  have  proposed  an  OCR  model  to  determine 
a  distortion  measure  by  performing  lossy  representation  to  achieve  a  hxed  entropy. 
Our  analysis  has  shown  that  the  rate-distortion  measure  based  on  the  ability  of  an 
OCR  engine  to  recognize  correctly  is  more  favorable  to  distance  ordering  than  to 
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row  ordering.  Furthermore,  the  results  suggest  an  inherent  hierarchy  within  the  code 
since  the  reverse  ordering  clearly  performs  more  poorly  than  the  forward  ordering. 
The  reason  that  reverse  row  ordering  did  worse  than  forward  row  ordering  is  that  in 
our  segmentation  of  components,  we  referenced  the  upper  left  corner  of  the  bounding 
box,  and  in  our  prototype  estimation  we  padded  the  lower  right  corner.  This  results 
in  an  informational  preference  toward  the  top  of  the  component  rather  than  the  bot¬ 
tom,  which  results  in  reverse  row  ordering  doing  worse  than  forward  ordering.  This 
also  agrees  with  the  entropy  rate  of  row-ordered  residual  coding  computed  in  Chapter 
5,  in  that  the  entropy  is  slightly  reduced  for  higher  entropy  rates,  implying  a  higher 
order  process  within  the  row-ordered  code. 

We  have  also  considered  a  distance-based  distortion  function  in  determining  the 
rate-distortion  characteristics  of  our  coding  method.  The  rate-distortion  tradeoff  can 
be  used  to  automatically  represent  a  lossy  image  in  terms  of  the  lowest  rate  and 
distortion.  Row-ordered  coding  is  not  likely  to  provide  the  lowest  rate  and  distortion 
point  since  for  a  fractional  entropy  rate  there  exists  fractional  distortion  across  all 
entropy  rates.  On  the  other  hand,  the  fractional  measured  distortion  with  respect 
to  a  fractional  entropy  rate  is  clearly  unproportional  across  all  entropy  rates  when 
we  use  the  distance  ordering.  Distance-based  matching  and  coding  may  eventually 
provide  a  measure  of  the  OCRability  of  a  given  document  image  prior  to  performing 
actual  OCR. 
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Chapter  8 


Compressed-domain  processing 


In  the  document  domain,  algorithms  which  implement  CCITT  Group  3  and  4  fax 
compression  have  been  used  widely.  In  the  document  domain,  the  measure  of  a  good 
compression  algorithm  may  have  a  number  of  parameters  beyond  the  traditional  space 
reduction.  For  example,  we  hud  that  digital  libraries  which  contain  document  images 
beneht  not  only  from  compression  which  reduces  the  amount  of  storage  necessary, 
but  also  from  the  ability  to  process  and  search  the  underlying  documents  easily  and 
efficiently. 

Due  to  the  large  volume  of  document  images  and  the  computationally  intensive 
tasks  that  need  to  be  performed  on  them,  there  exists  a  genuine  need  for  performing 
document  image  processing  in  an  efficient  manner.  Compressed-domain  processing 
has  been  able  to  address  this  by  accessing  features  used  by  the  compression  algorithm 
and  using  them  to  satisfy  processing  needs,  as  was  mentioned  in  Chapter  f .  Symbolic 
compression  of  document  images  provides  a  rich  source  of  these  features  and  a  large 
number  of  compressed-domain  processing  tasks  can  be  done  using  them. 

This  chapter  contains  a  discussion  of  a  number  of  applications  that  use  the  rep¬ 
resentations  described  previously.  In  Section  8.1  and  8.2  we  discuss  the  qualities  of 
our  representation  when  we  perform  lossy  compression  and  progressive  transmission. 
In  Section  8.3  we  demonstrate  the  ability  to  perform  scalable  retrieval  for  large  doc¬ 
ument  images.  In  Section  8.4  we  show  how  skew  estimation  and  correction  can  be 
performed  while  the  data  is  in  the  compressed  domain.  In  Section  8.5  we  show  how  to 
perform  keyword  searching,  and  in  Section  8.6  we  consider  the  problem  of  duplicate 
detection  in  an  image  database.  We  conclude  with  hnal  remarks  in  Section  8.7. 

8.1  Lossy  compression 

A  useful  option  in  compressing  an  image  is  to  be  able  to  specify  a  quality  of  com¬ 
pression,  or  a  required  size  limitation.  In  document  image  compression,  it  is  hard  to 
identify  a  measure  of  quality,  which  has  to  be  related  to  the  readability  of  the  docu¬ 
ment.  Most  measures  of  quality  are  related  to  the  overall  “power”  in  the  error  image, 
dehned  by  the  difference  between  the  lossless  and  lossy  representations.  We  have 
found,  however,  that  some  pixels  contribute  more  to  the  readability  of  a  document 
than  others,  and  even  if  two  representations  give  the  same  “error  power”,  one  may 
be  more  readable  than  the  other.  As  described  previously,  in  our  representation  we 
have  ordered  the  residual  pixels  associated  with  each  component  and  can  use  different 
fractions  of  the  residual  map  stream  to  achieve  different  degrees  of  lossy  compression. 
We  take  the  rendition  of  the  document  without  the  use  of  any  residual  pixels  as  a 
baseline  and  add  linearly  increasing  fractions  of  the  residual  stream. 
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Figure  8.1:  Lossy  compression  of  images  using  a)  80%,  b)  60%,  c)  40%,  d)  20%,  and 
e)  1%  of  the  error  streams  associated  with  the  templates. 
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Figure  8.2:  Difference  between  the  original  portion  of  the  image  and  its  lossy  compres¬ 
sion  using  a)  80%,  b)  60%,  c)  40%,  d)  20%,  and  e)  1%  of  the  error  streams  associated 
with  the  templates. 


(e) 


Figure  8.t  shows  an  image  which  has  been  lossily  compressed  to  varying  degrees, 
with  (a)  being  the  least  compressed  and  (e)  being  the  most.  The  percentage  indi¬ 
cates  the  amount  of  residual  information  kept  for  each  non-NULL-clustered  symbol 
(i.e.,  100%  would  yield  a  lossless  image).  Note  that  there  is  no  observable  difference 
between  the  least  and  most  compressed  images  as  regards  readability.  The  residual 
images  shown  in  Figure  8.2  show  that  the  residuals  lie  mostly  on  the  edges  of  the 
symbols,  and  that  even  at  the  highest  levels  of  lossy  compression,  the  rendition  is 
quite  readable  without  any  compromise  in  resolution. 

Figure  8.3  shows  examples  of  lossless,  symbolic-only,  and  lossy  compression  of 
an  image  using  symbolic  compression.  It  is  not  easy  to  see,  but  the  symbolic-only 
representation  contains  an  enormous  amount  of  information  about  the  content  of 
the  image.  Figure  8.4  shows  a  comparison  of  symbolic  compression  with  a  leading 
contender  that  uses  resolution  reduction.  Note  the  reduction  in  correct  recognition 
of  components  within  the  image.  Only  a  small  portion  of  the  image  is  shown  to  allow 
direct  comparison. 

For  122  images  in  the  University  of  Washington  document  database  we  achieved 
an  average  compression  ratio  of  29.56  when  we  used  1%  of  the  residual  stream,  25.71 
for  20%,  22.88  for  40%,  19.96  for  60%,  and  15.67  for  80%.  This  is  in  comparison  to 
lossless  compression  ratios  of  12.2  and  17.8  for  symbolic  compression  and  for  Group 
4  standard  respectively. 
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Figure  8.3:  Example  of  a  document  at  levels  of  symbolic  lossiness.  a)  Lossless  rep¬ 
resentation  taking  t20KB  of  storage  space,  b)  purely  symbolic  representation  taking 
33  KB  of  storage  space,  and  c)  lossy  representation  taking  44  KB  of  storage  space. 

8.2  Progressive  transmission 

In  transmitting  images,  an  ordered  stream  of  information  should  be  used  to  convey 
increasing  amounts  of  information,  so  that  the  image  can  be  rendered  losslessly  if  the 
entire  stream  is  transmitted.  In  transmission  of  textured  regions,  taking  a  blockwise 
DCT  (Discrete  Cosine  Transform)  of  the  image  and  sending  one  coefficient  per  block 
at  each  iteration  provides  a  reasonable  progressive  representation  of  the  image.  In 
the  case  of  document  images  and  the  symbolic  representation,  we  hrst  transmit  all 
the  prototype  and  symbolic  information,  and  we  order  the  residual  maps  based  on 
the  distance  transforms  of  the  prototypes  (or  more  precisely:  of  their  complements). 
Sending  the  residual  maps  in  this  order,  largest  distance  hrst,  renders  the  document 
quite  readable  at  the  start  and  provides  additional  detail  at  each  transmission  itera¬ 
tion. 

We  have  successfully  implemented  a  progressive  transmission  system  based  on 
our  representation.  Figure  8.5  shows  a  progression  for  a  test  image  which  consists 
of  instances  of  similar  symbols;  we  transmit  pixels  that  are  farthest  from  the  proto¬ 
types’  edges  hrst.  Since  larger  components  have  residual  pixels  with  larger  absolute 
distances,  they  are  reconstructed  hrst,  which  is  desirable  since  larger  components  are 
perceptually  more  salient  than  smaller  components. 

8.3  Sub-document  retrieval 

A  common  document  browsing  task  requires  retrieval  and  decoding  of  a  subregion  of 
the  document  image.  This  may  be  useful,  for  example,  for  expanding  a  region  such 
as  a  single  article  or  picture  from  a  thumbnail  of  a  compressed  newspaper  page.  This 
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Figure  8.4:  Comparison  of  lossy  compression  between  symbolic  and  JBIG  compression 
at  almost  constant  levels,  a)  Lossless  representation  of  a  portion;  (b-d)  range  of  lossy 
representations,  from  least  to  most,  using  symbolic  compression;  (e-g)  range  of  lossy 
representations,  from  least  to  most,  using  JBIG  compression. 
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Figure  8.5:  Progressively  sending  bits  ((a)  to  (f));  (f)  is  the  lossless  rendition. 
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Figure  8.6:  A001BIN.TIF  image  at  a)  (700, 700)-(2300, 2000)  b)  (700,700)-(1900,950) 
c)  (830, 1090)-(1280, 1490)  d)  Retrieval  time  as  a  function  of  area. 


can  be  done  using  our  representation  by  partially  decoding  the  appropriate  streams. 
Specifically,  we  decompress  the  prototype  size,  prototype  image,  block  membership, 
component  layout  index,  and  residual  map  index  streams,  and  only  partially  decom¬ 
press  the  largest  streams,  the  component  layout  and  residual  map  streams.  Using 
the  layout  index  stream  we  decompress  components  in  blocks  which  overlap  with  the 
subimage,  and  according  to  the  overlap  of  each  component  with  the  subimage  we 
decompress  its  residual  map.  An  example  is  shown  in  Figure  8.6a.  If  the  coordinates 
of  a  subtitle  are  known,  we  may  decompress  only  the  subtitle  (Figure  8.6b),  and  if 
the  top  left  corner  of  the  hrst  paragraph  is  needed,  we  can  decompress  accordingly 
(Figure  8.6c).  We  can  decompress  only  the  top  part  of  the  hrst  page  in  a  document 
to  determine  the  title  and  authors  of  the  document.  The  processing  time  depends 
on  the  area  of  the  region  of  interest.  Figure  8.6d  shows  a  plot  of  decompression  time 
versus  retrieval  area.  The  time  required  to  decompress  small  regions  can  be  viewed 
as  overhead  time,  and  the  increase  in  time  with  area  dehnes  the  scalability. 

8.4  Skew  estimation  and  correction 

Skew  estimation  and  correction  are  also  important  tasks  in  an  OCR  system.  By 
using  an  algorithm  based  on  the  Hough  transform,  we  are  able  to  estimate  skew  and 
correct  it  using  our  representation,  without  having  to  fully  decompress.  We  use  only 
the  position  and  size  of  each  component.  To  compute  their  values,  we  decompress 
the  prototype  size,  block  membership,  and  component  layout  streams.  We  input  the 
coordinates  of  the  middle  of  the  bottom  of  each  component  to  a  Hough  transform, 
and  thus  compute  the  skew.  For  the  f22  images  in  our  database,  it  took  an  average 
of  2.5  seconds  and  9f52  bytes  per  image  to  calculate  skew  to  an  accuracy  of  f/640 
vertical  units  per  horizontal  unit.  On  the  same  set  of  test  documents  the  average 
error  was  measured  to  be  0.f786  degrees. 

For  skew  correction,  we  modihed  the  component  layout  and  residual  map  streams 
and  reordered  them  if  the  components  moved  from  one  block  to  another.  Reordering 
is  computationally  expensive  and  does  not  contribute  much  correction  for  small  skew 
angles.  To  improve  performance,  we  bounded  the  movement  of  the  components  across 
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2.  Approach 

Text-intensive  docunienC  Images  typically  have  a 
great  deni  of  redundancy  in  the  hitman  representations 
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by  encoding  sets  of  similar  symbols  once,  and  repre¬ 
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Figure  8.7i  Deskewing  example!  a)  Skew  at  five  degrees,  b)  deskew. 
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Figure  8.8i  Small  portion  of  the  deskewing  example!  a)  Skew  at  five  degrees,  b) 
deskew. 

a  block  boundary  to  five  pixels;  below  this  bound,  movement  is  terminated  at  the 
edge  of  the  block.  For  instance,  if  a  component  is  located  close  to  a  block  boundary, 
and  by  deskewing,  it  moves  two  pixels  into  another  block,  it  is  moved  to  the  edge 
of  its  original  block.  We  take  the  middle  of  the  image  as  the  reference  point.  The 
translation  in  the  vertical  direction  is  computed  by  multiplying  the  horizontal  distance 
from  the  middle  of  the  page  by  the  slope  of  the  line  containing  the  reference  point 
and  component  location. 

Figure  8.7  shows  two  examples  of  skewed  documents  and  the  results  of  deskewing 
them.  For  the  122  test  images,  it  took  an  average  of  3.3  seconds  to  deskew  an  image. 
Figure  8.8  shows  two  small  examples.  Note  that  the  local  skew  is  not  corrected. 

8.5  Keyword  search 

Keyword  search  is  another  important  document-related  task.  A  method  reported  by 
Spitz  [98]  requires  scan-line  ordering  of  components  and  matching  of  ordered  compo¬ 
nents  to  queried  components.  To  implement  this  method  using  our  representation, 
we  decompress  the  prototype  size,  prototype  image,  block  membership  and  compo¬ 
nent  layout  streams.  The  skew  angle  of  the  document  is  first  estimated,  as  described 
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above,  and  the  bounding  boxes  are  horizontally  projected.  The  projection  is  then 
scanned  to  determine  the  locations  of  lines  of  text.  We  scan  the  lines  from  top  to 
bottom  and  record  the  components  which  each  line  contains.  This  method  picks  up 
disjoint  components,  such  as  the  dots  in  the  ‘i’  and  ‘j’,  very  nicely,  ft  is  also  very 
stable  to  errors  in  locations  of  objects  and  in  scan  direction. 

Once  the  components  are  ordered  we  apply  a  feature-based  matching  algorithm. 
We  classify  each  component  as  being  an  x-height,  an  ascender,  a  descender,  an  “i”, 
a  “j”,  or  punctuation.  A  second- level  classihcation  checks  if  there  exists  a  hole  in 
the  component,  and  a  third  checks  for  the  existence  of  a  concavity  from  the  right. 
These  features  are  computed  only  for  the  prototypes  rather  than  for  the  individual 
components.  The  heights  of  the  components  are  measured  only  for  the  hrst  ten  in¬ 
stances  of  each  prototype  in  the  image,  and  their  values  are  averaged.  These  features 
are  shown  in  Figure  8.9.  The  input  query  is  mapped  into  this  feature  space  using  a 
simple  lookup  table;  90%  of  the  maximum  score  is  taken  to  be  a  match.  Figure  8.f0 
shows  search  results  on  the  122  database  images  for  the  query  “approach”,  ft  took 
an  average  of  2.7  seconds  per  image  to  obtain  the  results.  There  are  fundamental 
ambiguities  associated  with  using  our  small  set  of  features;  for  example,  we  are  treat¬ 
ing  characters  like  “a”  and  “o”  as  belonging  to  the  same  feature  class.  However,  the 
effects  of  these  ambiguities  are  greatly  reduced  when  a  string  of  features  is  used  in 
matching. 

8.6  Duplicate  detection 

Consider  a  situation  where  thousands  of  documents  are  being  imaged  and  added  to  a 
database,  possibly  from  a  distributed  environment.  If  multiple  instances  of  the  same 
document  exist,  they  may  be  re-entered  into  the  database  unnecessarily.  This  may  not 
be  desirable  for  a  number  of  reasons,  including  increased  storage  cost,  difficulties  in 
maintaining  database  integrity,  increased  processing  cost  for  database  operations,  and 
cost  of  indexing  multiple  images  with  the  same  underlying  content.  The  dehnition  of 
a  “duplicate  instance”  is  open  to  some  interpretation  and  we  only  consider  duplicates 
where  multiple  instances  of  an  effectively  identical  original  source  are  scanned  for 
incorporation  into  a  database.  The  original  documents  may  have  been  written  on, 
stapled,  torn,  taped  or  may  have  pages  missing  or  had  a  cover  added.  The  document 
may  have  been  copied  repeatedly,  so  different-generation  copies  are  involved.  The 
document  may  have  been  scanned  at  different  times  and  on  different  devices,  so 
resolution,  illumination,  and  contrast  are  also  issues. 

One  approach  to  the  problem  of  duplicate  detection  is  to  address  it  from  the 
symbolic  point  of  view.  In  a  database  that  has  organized  the  image  information  based 
on  the  symbolic  representation,  it  is  possible  to  analyze  the  prototype  image  maps 
and  their  distribution  throughout  the  image.  This  distribution  can  be  compared  to 
the  distribution  of  another  image  by  associating  similar  prototypes.  Figure  8.11  shows 
two  prototype  image  maps  belonging  to  different  document  images.  The  similarities 
and  dissimilarities  of  the  prototypes  are  obvious.  To  create  a  correspondence  between 
them,  a  base  set  of  prototypes  is  created  which  is  common  to  the  two  images.  This 
base  set  also  takes  into  account  the  similarities  between  prototypes  in  the  same  image. 
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Figure  8.9:  Shape  code  description  for  use  in  keyword  searching. 
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Figure  8.10:  Match  results  for  string  “approach”:  (a)  from  A03L,  (b)  from  A03P, 
(c)-(d)  from  A053,  (e)  from  A054,  (f)-(j)  from  A061. 
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of  two  prototype  sets. 


The  resulting  base  set  (a  smaller  set  than  the  prototype  sets  of  both  images)  is  then 
used  to  determine  the  distributions  of  the  components.  The  distributions  of  these 
base  prototypes  are  then  compared  for  a  possible  match. 

Figure  8.t2  shows  the  distributions  in  the  images  of  the  prototypes  that  are  shown 
in  Figures  8.  ft  (a)  (hrst  image)  and  (b)  (second  image).  Figure  8.t2a  shows  the  dis¬ 
tribution  with  respect  to  a  prototype  base  set  derived  primarily  from  the  hrst  image 
and  Figure  8.t2b  is  the  distribution  of  the  second  image  with  respect  to  the  same 
prototype  set.  Figures  8.t2c  and  d  are  the  distributions  with  respect  to  a  proto¬ 
type  base  set  derived  from  the  second  image.  There  exists  some  level  of  discrepancy 
between  the  distributions  which  might  be  used  for  discrimination.  It  only  remains 
to  determine  the  level  of  similarity  after  some  degradation  corresponding  to  realistic 
duplication. 

Figure  8.t3  shows  distributions  for  an  image  before  and  after  various  amounts  of 
degradation.  There  exists  a  large  amount  of  similarity  between  the  distributions. 
However,  a  test  of  distribution  similarity  is  very  time  consuming,  rendering  this 
method  unrealistic  for  determining  duplicates  in  databases  containing  millions  of 
images.  This  test  does,  however,  provide  a  good  tool  for  performing  comparisons  at 
a  second  or  third  level  after  greatly  reducing  the  pool  of  suspect  documents. 

8.7  Summary 

Computer  systems  are  being  increasingly  used  to  process  and  analyze  images  of  many 
types.  These  tasks  vary  in  computational  complexity  and  in  storage  and  communi¬ 
cation  requirements.  Advances  in  data  compression  have  been  able  to  reduce  these 
requirements;  a  compressed  hie  takes  less  storage  space,  takes  less  time  to  read  and 
process,  and  takes  less  channel  capacity  to  transmit.  Compression  is  especially  im¬ 
portant  in  handling  document  images  because  of  the  large  hie  sizes  and  the  intensive 
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Figure  8.12:  Histograms  of  components  assigned  to  base  prototype  image  sets.  (a,b) 
and  (c,d)  have  common  prototype  bases  but  the  histograms  are  obtained  from  different 
images. 
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Figure  8.13:  Histograms  of  components  assigned  to  base  prototype  image  sets  for 
an  image  that  has  undergone  various  levels  of  degradation.  The  (a,b)  pair  has  the 
lowest  amount  of  degradation  and  the  (e,f)  pair  has  the  highest  amount  of  degrada¬ 
tion.  (a,c,e)  are  the  original  image  histograms  and  (b,d,f)  are  the  degraded  image 
histograms. 
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processing  they  require.  With  the  increasing  popularity  of  digital  libraries,  retaining 
and  presenting  scanned  documents  in  image  form  remains  a  cost-effective  and  usually 
accurate  way  to  distribute  document  information  that  was  originally  in  hard  copy. 

We  have  developed  a  system  that  addresses  most  aspects  of  document  image 
compression,  especially  those  related  to  transmission  and  processing.  Our  system 
achieves  compression  while  still  allowing  for  efficient  processing,  ft  is  evident  that 
the  important  information  in  documents  is  at  the  symbol  level  which  implies  that 
traditional  resolution-reduction  methods  may  alter  symbol  shapes  and  reduce  per¬ 
formance  in  document  processing  tasks.  By  symbol  coding,  derived  in  part  from 
pattern-matching  and  substitution  algorithms,  we  have  shown  that  image  coding  and 
retrieval  can  be  done  efficiently.  Image  analysis  tasks  such  as  skew  estimation,  cor¬ 
rection,  keyword  searching,  and  duplicate  detection  can  also  be  performed  efficiently 
since  the  symbolic  representation  preserves  sufficient  information  to  be  used  in  these 
tasks. 

In  this  chapter  we  have  addressed  the  problem  of  compression  strictly  in  the  im¬ 
age  domain,  and  for  good  reason.  It  is  clear  that  the  ultimate  form  of  symbolic 
compression  is  simply  to  recognize  the  symbols  and  represent  them  by  their  ASCII 
codes.  However,  such  a  compressed  representation  suffers  by  not  preserving  infor¬ 
mation  about  the  fonts,  point  sizes,  locations  on  the  page,  etc.  of  the  symbols.  To 
varying  degrees  such  features  can  be  preserved,  but  the  representation  does  not  allow 
us  to  reconstruct  a  truly  lossless  version  of  the  original.  A  hybrid  compression  scheme 
which  integrates  ASCII  symbols  which  can  be  recognized  with  high  accuracy  into  a 
scheme  such  as  ours  would  be  ideal. 
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Chapter  9 


Applications  and  extensions 


A  number  of  applications  and  enhancements  can  be  built  into  our  current  compression 
system  or  can  benefit  from  its  structure.  In  our  base  system  we  were  able  to  create 
an  infrastructure  that  can  be  used  for  compression,  transmission,  and  processing. 

An  application  that  was  considered  earlier  for  compressed-domain  processing  was 
document  duplicate  detection.  While  it  was  not  able  to  perform  fast  enough  to  handle 
millions  of  documents,  it  can  be  modified  to  handle  such  computational  demands.  One 
source  of  overhead  is  the  number  of  prototype  images.  If  we  can  map  the  prototypes 
into  a  fixed  number  for  all  the  documents,  this  will  reduce  computation  dramatically. 
Another  source  of  computational  reduction  is  in  the  number  of  components.  We 
can  consider  limiting  the  number  of  components,  at  the  cost  of  increasing  the  false 
detection  rate.  However,  preserving  the  order  of  the  components  as  they  appear  on 
a  scan  line  in  the  image  can  help  to  dramatically  reduce  the  false  detection  rate. 

A  possible  extension  of  our  base  compression  method  is  to  grayscale  document 
images.  Since  documents  may  contain  images,  it  may  be  unrealistic  to  scan  than  as 
binary  images.  For  multi-level  documents,  many  essential  document  image  charac¬ 
teristics  can  be  preserved  in  grayscale.  For  such  images,  resolution  reduction  is  still 
not  desirable,  but  grayscale  re-quantization  is  an  option  for  lossy  representation. 

Graphic  images  can  also  benefit  from  symbolic  compression.  A  graphic  image 
usually  contains  lines,  arrows,  legends,  and  a  small  amount  of  textual  information. 
There  are  very  few  of  these  constituent  patterns  and  there  is  hope  of  achieving  high 
levels  of  compression. 

Networked  databases  could  also  benefit  from  the  hierarchical  data  representation 
of  our  base  compression  scheme  and  from  enhancements  to  the  residual  coding.  Some 
problems  which  could  be  addressed  are  source- channel  coding,  design  of  variable 
error  resiliency  codes,  client-server  task  optimization,  packet,  TCP/IP,  and  ATM 
transmission  optimization. 

Finally,  we  have  experimented  briefly  with  extensions  to  different  media  types, 
including  textured  images,  video,  and  sound.  These  media  domains  already  have 
potential  for  developing  a  hierarchical  model.  The  extension  considered  here  is  the 
intelligent  merging  of  media  types  so  that  problems  of  compression,  transmission, 
and  processing  are  addressed  across  media  types.  A  hyper-document  [92]  serves  the 
purpose  of  integrating  multiple  media  types,  so  that  compression,  transmission,  and 
processing  of  hyperdocuments  can  be  considered. 

In  this  chapter,  we  present  preliminary  results  in  each  of  these  areas.  In  Section  9.f 
we  provide  an  alternate  approach  to  duplicate  document  detection,  in  addition  to  that 
suggested  in  Chapter  8,  which  can  process  millions  of  documents  on  user  demand.  In 
Section  9.2  we  present  some  preliminary  results  regarding  the  compression  of  grayscale 


document  images.  In  Section  9.3  we  give  an  argument  for  using  the  same  compression 
methodology  for  graphic  images.  In  Section  9.4  we  propose  a  way  to  address  the 
communication  of  document  images  in  a  networked  environment  along  with  related 
problems  and  advantages.  In  Section  9.5  we  extend  the  classical  document  function 
to  include  other  media  types  and  ht  these  media  sources  within  the  framework  of  our 
symbolic  compression.  We  conclude  in  Section  9.6  with  some  hnal  remarks. 

9.1  Duplicate  document  detection 

Depending  on  what  information  is  available  a  priori  in  a  database  system,  the  prob¬ 
lem  of  duplicate  detection  can  be  approached  in  a  number  of  ways.  If,  for  example, 
basic  index  information  such  as  the  document  number,  date,  title,  authors  or  number 
of  pages  is  entered  manually  prior  to  scanning,  this  information  can  serve  as  a  prelim¬ 
inary  hlter  for  duplicates.  In  most  cases,  however,  high-volume  operations  prohibit 
such  manual  entry  prior  to  scanning.  Instead,  we  would  like  to  identify  duplicates 
from  their  images  prior  to  any  manual  entry.  Although  we  have  basic  quantitative 
information  such  as  the  number  of  pages,  at  this  point  we  consider  only  the  analysis 
of  the  image  itself. 

One  possible  solution  which  has  been  proposed  is  to  apply  OCR  to  the  document 
image  and  match  as  much  text  as  possible  between  the  documents.  Although  this 
matching  can  be  done  relatively  quickly,  OCR  performance  suffers  on  degraded  doc¬ 
uments  in  terms  of  both  accuracy  and  speed.  For  this  reason,  we  do  not  feel  that 
OCR  is  feasible  as  a  hrst-level  hlter,  but  it  may  be  used  as  a  secondary  hlter  to  reduce 
possible  matches  from  hundreds  to  tens  of  documents. 

A  more  realistic  solution  (in  terms  of  resources  and  time  requirements)  would 
require  an  approach  that  does  not  process  every  image  every  time  duplicate  detection 
is  initiated.  This  is  logical  in  the  sense  that  for  each  detection  procedure  all  images 
in  the  corpus  need  to  be  processed  redundantly.  If  an  algorithm  can  produce  features 
that  need  only  be  calculated  once,  duplication  detection  could  beneht  from  this.  The 
detection  process  would  not  be  directly  dependent  on  the  presented  image  but  only 
on  the  set  of  features  derived  from  it.  To  consider  this  approach  we  need  to  determine 
a  signature  that  describes  the  presented  image  accurately  and  is  consistent  across  its 
duplicate  instances,  but  that  also  provides  sufhcient  separability  to  distinguish  it  from 
non-duplicates.  Then  we  can  address  concerns  about  determining  this  signature  by 
performing  compressed-domain  processing  of  the  image  in  symbolic  (or  other)  form. 
(Hull  [40]  operated  on  CCITT  G4  images  to  detect  duplicates.)  A  large  number  of 
features  can  be  used  to  dehne  the  signature  to  be  used  for  duplicate  detection.  We 
are  constrained  by  the  fact  that  we  may  be  dealing  with  millions  of  documents,  many 
of  which  may  be  highly  degraded.  To  cope  with  such  situations,  we  must  have  a 
signature  for  each  document  which  is 

Robust  -  The  signature  should  be  reliably  extractable,  even  when  the  document 
becomes  degraded. 

Unique  -  Although  we  cannot  realistically  expect  the  signatures  to  be  unique  unless 
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we  use  an  excessively  large  feature  set,  a  given  signature  should  be  associated 
with  several  tens  of  documents  at  most. 

Compact  -  The  storage  capacity  required  to  hold  the  signatures  of  millions  of  ob¬ 
jects  may  be  very  large,  so  the  index  keys  should  be  as  small  as  possible. 

In  addition,  the  algorithms  which  extract  the  signature  must  be 

Fast  -  Algorithms  which  take  minutes  to  extract  a  signature,  and  then  attempt  to 
match  it  against  each  document  in  the  database,  are  not  acceptable.  The  appli¬ 
cation  demands  rapid  extraction,  and  near  constant  time  indexing  of  previously 
entered  documents 

Scalable  -  Initially  the  algorithms  will  work  on  hundreds  of  document,  but  as  more 
documents  are  processed,  the  size  of  the  database  could  grow  to  tens  of  millions. 

Accurate  -  It  is  acceptable  to  miss  a  small  percentage  of  duplicates  since  the  result 
is  only  that  the  same  document  is  entered  twice,  but  identifying  documents  as 
duplicates  when  they  are  not  (false  alarms)  is  not  acceptable. 

9.1.1  Related  work 

The  detection  of  duplicate  or  near-duplicate  documents  has  been  a  problem  of  interest 
for  some  time  in  many  helds,  but  has  not  been  addressed  for  collections  of  images. 
Some  example  domains  include  education,  for  detection  of  plagiarism  [78];  publishing, 
for  detection  of  unauthorized  copies  [89,  96];  databases,  for  maintaining  database 
integrity;  information  retrieval,  for  information  hltering  [ff5];  and  in  the  USENIX 
community,  for  detecting  duplicate  hies  [67]. 

In  the  document  community,  most  of  the  work  on  identifying  similar  documents 
has  been  done  using  either  ASCII  documents  or  “water-marked”  electronic  represen¬ 
tations.  Much  less  work  has  been  done  with  document  images.  A  notable  exception  is 
Hull  [39],  who  describes  a  method  for  matching  documents  which  have  the  same  char¬ 
acter  content  but  which  may  have  been  reformatted  or  distorted  prior  to  re-imaging 
(content-variant  documents).  Hull’s  approach  represents  each  document  by  a  set  of 
robust  local  features  which  can  be  used  to  hash  into  a  database  of  descriptors.  The 
features  in  both  the  query  example  and  the  database  must  be  invariant  to  geometric 
distortions;  by  extracting  multiple  descriptors  from  each  document,  they  can  also 
be  made  robust  to  errors  in  feature  extraction.  The  measure  of  similarity  is  simply 
the  number  of  features  the  query  document  and  the  database  instance  have  in  com¬ 
mon.  Experiments  were  performed  using  the  symbolic  approach  where  distributions 
of  components  assigned  to  a  common  set  of  prototypes  were  examined.  However, 
this  process  relied  heavily  on  the  assumption  that  the  images  should  be  represented 
in  symbolic  format;  this  may  not  provide  an  efficient  approach.  Another  experiment 
was  performed  using  as  features  the  character  counts  for  each  word  in  short  sequences 
of  words;  this  provided  a  set  of  simple  yet  robust  features  that  was  adequate  for  small 
databases.  With  as  few  as  ten  features,  f  00%  accuracy  was  obtained  for  a  clean  query 
string  and  a  small  clean  database.  Unfortunately,  as  the  number  of  documents  grows, 
the  character  count  metric  becomes  less  discriminatory. 
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Figure  9.1:  Sample  character  shape  code  assignment. 


9.1.2  Basic  approach 

Another  approach  is  based  on  the  extraction  of  a  signature  from  a  representative  line 
of  text  in  the  document  image  using  a  shape  coding  technique.  The  technique  has  been 
used  by  a  number  of  authors  including  Tanaka  [103]  and  Spitz  [99]  for  other  document 
analysis  applications.  Shape  coding  labels  the  symbols  in  the  line  of  text  based  on 
very  simple  shape  properties,  such  as  whether  they  are  ascenders,  descenders,  limited 
to  the  X-line,  multi-component,  or  punctuation,  for  example.  These  properties  are 
much  more  robust  to  noise  than  the  features  necessary  for  OCR,  and  can  be  extracted 
fairly  rapidly. 

To  extract  the  signature,  the  document  is  scanned  for  a  representative  sample  of 
text,  typically  on  the  order  of  50  symbols,  on  a  single  line  or  across  several  lines.  For 
this  sample,  the  base-line,  x-line,  ascender-line  and  descender-line  are  identihed,  and 
each  character  component  is  assigned  a  shape  code  as  shown  in  Figure  9.1. 

The  string  of  shape  codes  assigned  to  the  characters  of  the  text  sample  is  used  as  a 
signature  for  the  document.  A  second  level  of  robustness  is  added  by  indexing  based 
on  n-grams  of  the  codes,  rather  than  attempting  to  use  an  index  based  on  the  entire 
string.  Fach  shape  code  n-gram  serves  as  an  index  key  into  the  database.  A  single 
dropped  or  inserted  code  will  affect  a  small  number  n  of  these  keys  but  will  not  affect 
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Figure  9.2:  Overview  of  indexing  scheme. 


the  entire  signature.  Figure  9.2  shows  the  relationship  between  the  signature,  its 
keys,  and  the  database.  When  a  set  of  keys  is  presented  for  indexing,  each  key  results 
in  a  set  of  hits  from  the  database.  Fach  hit  is  counted  as  a  vote  for  the  resulting 
document,  and  a  ranked  set  can  be  returned. 

Clearly,  a  number  of  additional  issues  should  be  addressed  in  developing  a  com¬ 
plete  system  that  satishes  the  criteria  set  forth  above.  These  include: 

•  use  of  global  classihers  -  number  of  pages,  page  component  statistics,  etc.  as 
hrst-level  biters  to  reduce  the  duplicate  search  space. 

•  choice  of  a  shape  code  alphabet  -  selection  of  features  to  incorporate  into  the 
signature  which  provide  maximum  discrimination. 

•  extraction  of  features  -  how  to  select  the  signature  in  the  image  of  a  document. 

•  database  organization  and  indexing  -  how  to  create  efbcient  ways  to  index  into 
large  collections. 

•  veribcation  of  candidate  duplicates. 

All  of  these  issues  are  addressed  while  designing  a  working  system. 
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9.1.3  Feasibility  analysis 

Before  implementing  and  testing  this  approach  on  real  data,  we  performed  a  theoreti¬ 
cal  analysis  to  see  if  our  approach  is  realistic  and  if  it  is  robust  to  errors  in  signature  ex¬ 
traction.  The  parameters  that  were  varied  in  our  analysis  included  system-dependent 
variables  such  as  hie  size  limitations  of  the  operating  system,  disk  access  time  and 
disk  transfer  rate;  database  variables  such  as  the  number  of  documents,  the  size  of 
the  index  table  and  the  average  size  of  the  documents;  and  algorithm  variables  such 
as  the  size  of  the  signature  alphabet,  the  size  of  the  signature  and  the  key  or  n-gram 
size. 

The  analysis  yielded  qualitative  estimates  of  the  expected  size  of  the  database, 
the  computational  requirements  for  matching  signatures,  and  the  number  of  missed 
and  false  duplicate  detections  as  functions  of  the  database  size.  It  was  found  that  the 
system  could  be  implemented  with  generally  available  hardware. 

The  analysis  was  performed  assuming  we  have  extracted  a  candidate  signature. 
The  following  is  a  list  of  variables  which  can  be  used  in  the  feasibility  and  perfor¬ 
mance  analysis  of  the  algorithm.  Here  the  signature  is  the  vector  of  feature  values 
used  to  represent  the  document  and  a  key  is  a  term,  possibly  resulting  from  a  par¬ 
titioning  of  the  document  signature,  used  to  index  into  the  database.  A  majority  of 
the  parameters  reflect  physical  constraints  of  the  system  and  are  necessary  to  explore 
scalability. 

Algorithm-Dependent  Variables 

N  :  Number  of  documents. 

Nj  :  Maximum  number  of  tiles  for  the  index  table. 

S f  :  Maximum  tile  size  for  storing  b  buckets. 

E  :  Size  of  index  table  entries  in  bytes. 
d  :  Average  size  of  documents  in  bytes. 

/  :  Number  of  buckets  that  will  be  kept  in  the  main  memory. 
b  :  Number  of  index  table  buckets  that  will  be  stored  in  a  tile  in  main  memory. 


Independent  Variables 

a  :  Size  of  alphabet  used  to  construct  the  signature. 
m  :  Size  of  the  signature. 
w  :  Window  size. 

k  :  Number  of  keys  for  a  given  signature  size  (m  —  re  +  t). 

The  matching  algorithm  relies  on  the  index  structure  that  is  generated  as  doc¬ 
uments  are  added  to  the  database.  Each  signature  is  partitioned  into  equal-sized 
overlapping  windows  of  size  re,  to  be  used  as  index  keys.  This  partitioning  provides 
robustness  in  the  sense  that  errors  in  the  signature  will  only  be  propagated  within  the 
window,  at  the  expense  of  a  less  unique  set  of  sub-keys.  For  a  signature  of  size  m  we 
have  k  =  (m  —  re  +  t)  possible  keys  (Figure  9.2)  to  be  indexed.  The  index  table  has 
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Bucket 


Figure  9.3:  The  index  table. 

on  the  average  k  X  (^)  entries  per  bncket  with  each  entry  being  the  identihcation  of 
a  docnment  which  contains  that  key.  Fignre  9.3  shows  the  index  table  strnctnre.  A 
table  with  a  maximnm  size  of  MinjfV,  X  (^))}  keeps  a  connt  of  the  docnments 
that  have  matched  the  inpnt  key.  When  the  document’s  keys  are  indexed,  a  connter 
is  incremented  for  each  docnment  which  contains  that  key.  The  most  freqnently  oc- 
cnrring  docnments  will  then  be  the  candidate  dnplicates.  A  more  in-depth  discnssion 
is  contained  in  [23]. 

9.1.4  Performance  analysis 

To  characterize  the  performance  of  a  specihed  system  we  hrst  need  to  formnlate 
several  probabilistic  models  and  dehne  a  design  criterion.  This  is  done  by  nsing 
hypothesis  testing  and  deriving  an  appropriate  distribntion  fnnction  snch  that  specihc 
performance  measnres  can  be  compnted.  For  a  hypothesis  test  consider  two  events, 
called  the  nnll  and  alternative  events.  The  nnll  event,  ifo,  signihes  a  sitnation  where 
there  is  no  dnplicate  for  a  given  docnment,  and  the  alternative  event,  ifi,  signihes 
that  there  is  a  dnplicate. 

Ho  :  Nnll  Hypothesis  P{j]\Ho)  (9-1) 

H\  :  Alternative  Hypothesis  P(j]\H\) 

Associated  with  each  hypothesis  is  a  probability  measnre  which  characterizes  the 
variable  nsed  for  detection,  in  this  case  rj.  In  detection,  a  thresholding  operation 
is  nsed  to  determine  which  hypothesis  is  valid.  In  this  case  ry  signihes  a  matching 
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score  assigned  to  an  observation.  By  choosing  an  appropriate  threshold  t]t  we  make 
a  decision  that  ii  rj  >  t]t  then  accept  the  alternative  hypothesis  and  ii  rj  <  t]t  then 
accept  the  nnll  hypothesis.  The  probabilities  of  detection  and  of  false  alarm  effectively 
captnre  the  performance  of  this  detector: 

Pd^Ut)  =  P{v  >  (9.2) 

Pfa{vt)  =  P{v  >  Vt\Ho) 

This  set  of  nnmbers  will  be  nsed  as  operating  specihcations  given  varions  parameters, 
bnt  to  fnlly  specify  Pd  and  PpA  we  need  to  analyze  or  make  assnmptions  abont  the 
available  data  and  the  matching  processes. 

Given  a  signatnre  of  size  m  from  an  alphabet  of  size  a  and  an  observation  window 
of  size  re,  dehne  Xi  to  be  the  iih  symbol  and  yj  to  be  the  jth  index  key.  With 
Xi  G  Xi  —  {0, 1}^°S2(“)  it  is  easy  to  see  that  all  the  Xj’s  are  independent.  In  fact 

m  m 

P[X  =  x)  =  n  P{X^  =  Xi)  for  X  G  X  =  [J  =  {0,  f  (9.3) 

i=\  i=\ 

Let  ns  dehne  yi  —  {xj,  Xj+i, ...,  this  is  clearly  dependent  on  to  As 

described  in  the  previons  sections,  the  y^’s  will  be  nsed  for  indexing  into  a  signatnre 
database  and  the  nnmber  of  hits  determines  rj.  In  indexing,  however,  there  is  no 
preference  as  to  the  order  of  the  observed  yi  valnes,  and  in  a  hypothetical  sitnation 
two  doenments  conld  have  large  nnmbers  of  hits  by  matching  y^’s  ont  of  order.  Smaller 
w  valnes  wonld  increase  this  effect  and  larger  valnes  wonld  decrease  it;  however,  in 
most  cases  we  can  argne  that  the  yi  valnes  nsed  to  calcnlate  y  are  independent.  We 
shall  assnme  that  it  is  eqnally  probable  (with  probability  to  observe  any  yi  and 
shall  now  present  onr  probability  analysis. 

Under  the  conditions  stated  above,  given  a  signatnre  we  observe  a  set  of  featnres 
yi  for  i  —  f ,  2, ...,  &.  Since  each  yi  is  now  treated  as  an  independent  observation  and 
we  have  k  observations,  we  can  identify  the  probability  of  y  matches  ont  of  k  possible 
matches  to  be  binomial,  &(&,  -X)- 


ffere  P{y)  is  the  probability  of  the  nnmber  of  matches  given  the  nnll  hypothesis, 
P[y\p[o).  This  probability  in  effect  measnres  the  relationship  between  different  sig- 
natnres  and  is  relatively  easy  to  nnderstand  and  compnte.  The  assnmption  that  the 
occnrrences  of  keys  are  eqnally  likely  in  the  realm  of  all  allowable  keys  is  nnrealistic. 
For  example,  it  is  nnrealistic  for  a  window  observation  to  consist  of  all  ascenders. 
We  therefore  intend  to  derive  realistic  probabilities  of  key  occnrrences  from  experi¬ 
ments;  this  may  revise  the  probability  distribntion  fnnetion  of  the  nnll  hypothesis. 
The  probability  measnre  for  the  alternative  hypothesis,  however,  is  more  involved. 

To  derive  the  probability  distribntion  of  the  alternative  hypothesis,  we  need  to 
consider  what  realistic  errors  may  be  fonnd  between  a  candidate  signatnre  and  its  cor¬ 
responding  entry  in  the  database.  The  discrepancies  between  observed  and  recorded 


95 


signatures  are  in  the  form  of  insertions,  deletions,  or  changes  of  shape  code.  In  gen¬ 
eral  it  is  more  probable  to  have  fewer  errors  than  to  have  more,  so  we  assume  that 
the  probability  of  observing  i  errors  is  a  decaying  exponential  function,  dehned  by 
a  decay  rate  /3  .  The  general  form  of  this  function  is  P{i)  =  P(0)e“^*,  where  P(0) 
should  be  calculated  by  letting  J27LoP{i)  =  1-  Calculating  P(0)  and  formulating  the 
distribution  function  we  get 


P{i)  =  P(0)e^*  =  e  for  z  =  0,  f , m.  (9.5) 

This  equation  is  not  exact  since  this  probability  measure  depends  on  the  accuracy  of 
identifying  shape  codes,  the  statistics  of  shape  codes,  and  the  similarity  of  duplicates 
found  in  the  database,  ft  is,  however,  impossible  to  characterize  the  statistical  nature 
of  these  phenomena,  so  we  have  simplihed  P{i)  to  an  exponential  function.  In  order 
to  formulate  the  probability  distribution  of  the  alternate  hypothesis  we  need  to  study 
the  relationship  between  the  number  of  errors  and  the  matching  score.  In  the  case 
where  only  one  error  occurs,  the  error  is  propagated  to  w  out  of  k  index  entries. 
This  results  in  a  matching  score  oi  k  —  w  out  of  a  maximum  of  k.  Although  the 
errors  might  occur  at  either  end  of  the  signature,  and  insertions  and  deletions  change 
the  matching  score  (by  1)  in  contrast  with  shape  changes,  these  occurrences  are 
statistically  insignihcant  and  offset  each  other.  For  multiple  errors,  the  worst-case 
scenario  is  the  case  where  the  errors  occur  w  shapes  away  from  each  other  and  these 
errors  propagate  to  the  greatest  possible  number  of  allowable  window  observations. 
For  this  worst-case  scenario,  i  errors  will  translate  to  a  maximum  matching  score  of 


k-iw  for  z  =  0,  f, ..., 
0  otherwise 


(9.6) 


For  the  best-case  scenario,  the  errors  could  occur  next  to  each  other,  yielding  a 
maximum  matching  score  of 


'^best 


k- 


W 


k  —  i  for  i  =  0,  f , ...,  &. 


(9.7) 


The  probability  that  worst-case  or  best-case  errors  occur  is  clearly  dependent  on  the 
number  of  errors.  This  is  true  since  z  =  f,  by  dehnition,  is  a  worst-case  scenario 
and  i  =  m  it  is  a  best-case  scenario,  ft  is  also  true  that  the  score  probabilities 
change  as  a  power  function  with  respect  to  the  number  of  errors,  and  their  analysis 
is  extremely  complex.  Therefore,  for  the  sake  of  simplihcation,  and  as  will  be  seen 
in  the  following  section  we  will  assume  that  the  matching  score  is  an  average  of  the 
worst-  and  best-case  scenarios: 


ri{i) 


“^worst  T  "^best 
2 


^  _  i+iXw 


for  i  =  0,f,..., 


for  i  = 


2k 

ti;  +  l 


+  f,...,m  —  rc  +  i! 


(9.8) 


Using  (9.5)  and  (9.8),  the  probability  distribution  function  of  the  alternative  hypoth¬ 
esis  is  fully  specihed.  We  are  now  ready  to  calculate  the  probabilities  of  detection 
and  false  alarm. 
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Given  a  threshold  tjt-,  we  make  a  decision  that  any  document  with  a  higher  score 
is  identihed  as  a  dnplicate,  as  shown  in  (9.2).  We  now  have  to  apply  this  criterion  to 
the  probability  distribntion  fnnctions  of  the  nnll  and  alternative  hypothesis  to  get  the 
probability  of  false  alarm  and  probability  of  detection,  respectively.  The  distribntion 
fnnction  of  the  nnll  hypothesis  is  given  in  (9.4),  and  snmmation  over  ry  =  Tyx,...,^ 
gives  the  probability  of  false  alarm: 

=  P„>, ,,//„)=  e(J)(;1)'(i (9.9, 


In  order  to  calcnlate  the  probability  of  detection  we  need  to  consider  the  inverse  of 
(9.8)  to  be  nsed  in  (9.5).  Using  t]t  and  solving  for  i  in  (9.8)  gives  the  maximnm 
nnmber  of  allowable  errors  snch  that  the  effective  score  is  the  threshold  score: 


i 


max 


2(k-7]T) 

1+ti; 

k  -rjT 


for 

otherwise 


(9.10) 


Then  snmming  (9.5)  from  zero  to  imax  gives  the  probability  of  detection: 


imax  /  1  _  P~^  \ 

Pd  =  P(v  >  („,.,)  (9.11) 

Given  eqnations  (9.9)  and  (9.11),  it  is  possible  to  plot  the  probability  of  false  alarm 
with  respect  to  the  probability  of  detection  as  a  fnnction  of  the  threshold  t]t-  Tra¬ 
ditionally,  this  plot  is  called  the  ROC  (Receiver  Operating  Characteristic)  cnrve;  a 
typical  cnrve  is  shown  in  Fignre  9.4.  Fignre  9.4  also  shows  a  diagonal  line  which 
signihes  the  minimum  achievable  performance.  The  diagonal  line  represents  picking 
the  nnll  or  the  alternative  hypothesis  based  on  a  coin  toss;  if  an  algorithm  performs 
below  this  line,  a  coin  toss  wonld  achieve  better  performance. 


9.1.5  System  design 

Onr  model  basically  has  two  main  prohles:  performance  and  system  resonrces.  The 
performance  can  be  characterized  by  specifying  the  error  tolerance.  The  system 
resonrces  can  be  thonght  of  as  constraints  on  the  desired  performance.  It  is  hard 
to  hnd  a  single  formnla  that  can  give  ns  optimnm  valnes  of  the  model  parameters 
(m,  re,  a)  for  given  system  resonrces  and  performance  range.  Bnt  we  can  at  least  give 
a  recipe  for  hnding  “good”  valnes  for  the  parameters.  Once  the  system  resonrces 
are  characterized,  making  tables  (like  Tables  9.1  and  9.2)  for  different  combinations 
of  the  model  parameters  (m,w,a),  and  considering  constraints  for  different  valnes  of 
these  parameters,  helps  ns  nnderstand  the  possible  ranges  of  the  model  parameters. 
Then,  by  drawing  ROC  enrves  and  looking  for  a  desired  detection  probability  range, 
we  can  shrink  the  ranges  of  the  parameters  farther. 

Let  ns  comment  on  each  parameter’s  effect.  The  increase  in  the  valne  of  m  (the  size 
of  the  signatnre)  increases  the  search  time  and  the  size  of  the  index  table.  However, 
the  probability  of  dnplicate  detection  becomes  higher  (Fignre  9.5). 
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a 

Nnmber 

Fntries 

Bncket 

Bnckets 

Bncket  hie 

Nnmber 

Fxpected 

of 

per 

size 

per 

size 

of 

search 

bncket  s 

bncket 

(MB) 

hie 

(MB) 

hies 

time  (sec) 

4 

f024 

2246094 

8.6 

1 

8.6 

1024 

38.68 

5 

3f25 

736000 

2.8 

1 

2.8 

3125 

12.82 

6 

7776 

295782 

f.f 

1 

f.f 

7776 

5.23 

7 

f6807 

f36848 

0.5 

3 

1.6 

5603 

2.64 

8 

32768 

70f9f 

0.3 

7 

1.9 

4682 

1.56 

Table  9.1:  Index  table  characteristics  for  w  =  5,  m  =  50,  N  =  50M,  K  =  2MB. 


w 

Nnmber 

Fntries 

Bncket 

Bnckets 

Bncket  hie 

Nnmber 

Fxpected 

of 
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size 
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size 

of 

search 

bnckets 

bncket 

(MB) 

hie 

(MB) 

hies 

time  (sec) 

3 

512 

4492188 

17.136 

1 

17.1 

512 

76.94 

4 

4096 

561524 

2.142 

1 

2.1 

4096 

9.98 

5 

32768 

70191 

0.268 

7 

1.9 

4682 

1.56 

6 

262144 

8774 

0.033 

59 

2.0 

4444 

0.56 

7 

2097152 

1097 

0.004 

477 

2.0 

4397 

0.43 

8 

16777216 

138 

0.001 

3799 

2.0 

4417 

0.42 

Table  9.2:  Index  table  characteristics  for  a  =  8,  m  =  50,  N  =  50M,  K  =  2MB. 

The  increase  in  the  valne  of  w  (the  window  size)  makes  the  detection  probability 
lower  (Fignre  9.6).  We  conclnde  from  Fignres  9.6  and  9.5  that  only  k  =  m  —  (the 
nnmber  of  keys)  matters  for  the  detection  probability,  becanse  increasing  w  means 
decreasing  k,  which  lowers  the  detection  probability. 

The  valne  of  a  (the  alphabet  size)  actnally  has  no  signihcant  effect  on  the  detection 
probability  (Fignre  9.7).  Bnt  increasing  it  improves  the  detection  probability,  since 
the  false  alarm  rate  drops. 

The  choice  of  a  and  w  is  crncial  for  the  nnmber  of  entries  per  bncket.  The  nnmber 
of  entries  per  bncket  affects  the  search  time,  and  as  the  search  time  increases,  the 
nnmber  of  entries  increases.  As  re  or  a  increases,  the  nnmber  of  entries  per  bncket 
decreases,  bnt  on  the  other  hand  the  errors  increase.  Smaller  valnes  of  w  are  desirable 
for  robnstness. 

9.1.6  Experiments 

Two  experiments  were  performed  by  Doermann  et  ah  [23]  with  the  resnlts  given 
below.  The  hrst  experiment  nsed  a  small  database  of  5000  text  lines  extracted  from 
the  Wall  Street  .Jonrnal’s  ASCII  representation  of  news  and  reports.  Fach  line  was 
treated  as  an  independent  doenment  so  the  database  contained  5000  doenment  im- 
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Index 

Shape  Code 

Description 
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0 

- 
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4 

A 
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5 

g 
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g  p  q 

6 

a 

xhne  with  hole 

a  e  o 

7 

i 

ascender  mark 

i  j  ‘ 

Table  9.3:  Table  of  shape  codes  and  symbols  to  which  they  apply. 
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In  Part  I  of  the  experiment,  a  set  of  signatnres  known  t 
matched  against  the  database.  The  resnlts  of  matching  a 
(831)  are  shown  in  Table  9.4  for  the  cases  where  0,  5,  tO 
introdnced.  Table  9.4  shows  that  althongh  the  match  sco: 
there  are  only  5  errors,  the  correct  docnment  still  has  a  si 
alternative  choices.  The  large  difference  between  the  top 
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Table  9.4:  Scores  of  line  831  in  [candidate  (score)]  format. 


match  is  not  simply  random,  bnt  rather  well  correlated  with  its  candidate. 

To  test  the  robnstness  of  the  featnres,  signatnres  of  text  lines  that  were  not  in  the 
original  database  (i.e.  docnments  known  to  be  non-dnplicates)  were  matched  against 
the  lines  in  the  database.  Table  9.5  shows  the  match  scores  for  snch  a  text  line. 

In  Part  II  of  the  experiment,  tOO  signatnres  were  randomly  chosen  from  the  5000 
in  the  database  and  pertnrbed  to  be  matched  against  the  5000  signatnres  in  the 
database.  The  correct  match  was  recorded  in  the  the  top  one,  two,  hve,  and  ten 
positions.  The  resnlts  are  shown  in  Table  9.6  when  the  candidates  were  corrnpted 
with  0,  5,  to,  15  and  20  errors.  In  practice,  the  variation  between  two  docnments 
which  are  trne  dnplicates  is  typically  dne  to  factors  snch  as  notes,  photocopying  and 
aging,  and  dne  to  characteristics  of  the  scanning  process,  inclnding  resolntion,  density 
and  skew.  It  is  expected  that  few  “differences”  exist  in  the  shape  codings  of  dnplicate 
docnments,  so  the  introdnction  of  20  errors  is  more  then  snfficient. 

In  the  second  experiment  a  mnch  larger  database  was  considered  for  experimen¬ 
tation.  The  ASCII  data  was  similar  to  the  database  of  the  hrst  experiment  bnt  with 
millions  of  signatnres.  2,500  non-dnplicate  and  2,500  randomly  chosen  dnplicate  sig¬ 
natnres  were  nsed  to  simnlate  incoming  docnments. The  hrst  line  of  Table  9.7  shows 
the  nnmber  of  dnplicates  detected  in  the  top  t,  2,  5,  tO  and  20  positions^. 

The  corrnpted  signatnres  of  dnplicate  docnments  were  nsed  in  the  matching  pro¬ 
cess,  pertnrbing  them  by  introdncing  hxed  nnmbers  of  errors  (5,  8  and  tO).  The 
nnmbers  of  dnplicate  docnments  detected  in  the  top  t,  2,  5,  tO,  and  20  positions  for 
these  levels  of  errors  are  shown  in  the  bottom  three  lines  of  Table  9.7.  Fignre  9.9 
shows  the  percentage  of  detected  dnplicates  as  a  fnnction  of  the  nnmber  of  docnments 


^Some  of  the  lines  in  the  original  database  occur  more  than  20  times,  so  the  “true”  duplicate 
may  not  appear  in  the  top  20.  That  is  why  even  in  the  0  error  case,  the  2500  documents  were  not 
all  detected. 
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Table  9.5:  Scores  of  line  5416,  which  was  not  in  onr  database,  in  [candidate  (score)] 
format. 


Added  Errors 

Top  1 

Top  2 

Top  5 

Top  10 

Top  20 

0 

100 

100 

100 

100 

100 

5 

100 

100 

100 

100 

100 

10 

100 

100 

100 

100 

100 

15 

51 

58 

69 

77 

100 

20 

17 

21 

24 

30 

100 

Table  9.6:  Top  dnplicate  candidates  in  tOO  qneries. 


Added  Errors 

Top  1 

Top  2 

Top  5 

Top  10 

Top  20 

0 

2416 

2443 

2458 

2465 

2467 

5 

2379 

2419 

2447 

2457 

2463 

8 

2059 

2202 

2327 

2390 

2431 

10 

1403 

1678 

1905 

2039 

2181 

Table  9.7:  Nnmber  of  dnplicate  candidates  detected  in  2500  qneries  from  a  pool  of 
one  million  docnments. 
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Figure  9.9:  Percentage  of  duplicate  documents  identified  in  the  top  n  candidate  doc¬ 
uments  retrieved. 

retrieved  for  each  of  the  error  levels. 

9.2  Grayscale  extension 

The  domain  of  document  images  is  by  no  means  limited  to  binary  images.  A  document 
image,  much  like  other  images,  can  have  pixels  that  span  the  entire  color  spectrum. 

The  majority  of  the  work  done  with  document  images  involves  high-contrast  black 
and  white  images  which  are  ideal  for  bi-level  image  capturing.  Our  hrst  extension 
from  this  baseline  is  to  consider  grayscale  document  images.  While  color  document 
images  need  to  eventually  be  considered,  the  grayscale  extension  is  more  widely  ap¬ 
plicable.  Figure  9.10  shows  a  section  of  a  grayscale  document  image,  which  exhibits  a 
large  amount  of  symbol-level  redundancy  and  background  texture.  Given  a  success¬ 
ful  application  of  component  segmentation,  clustering,  background  extraction,  and 
residual  coding,  the  application  of  symbolic  compression  is  straightforward. 

A  survey  of  character  segmentation  was  conducted  by  Casey  and  Lecolinet  [19]  and 
a  method  of  grayscale  segmentation  of  characters  is  given  in  [58].  In  our  preliminary 
implementations,  we  performed  connected  component  analysis  on  a  thresholded  im¬ 
age.  A  successful  grayscale  segmenter  is  envisioned  to  be  one  that  uses  a  combination 
of  connected  components  based  on  ranges  of  thresholds  and  previously  determined 
prototypes.  The  clustering  algorithms  are  in  general  very  similar  to  those  used  for 
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They  are  stone  wo  ris,  with  a  fossil 
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tion  of  characean  algae  as  an  object  of 
study  is  the  size  of  their  cells.  In  Cham, 


Figure  9.10:  Example  of  a  grayscale  document  image. 


clustering  of  binary  components.  The  only  difference  is  in  the  distance-based  match¬ 
ing  since  distance  is  not  strictly  computable  from  a  grayscale  component.  Other 
matching  methods  can  address  the  problems  of  grayscale  components.  The  two  re¬ 
maining  tasks,  background  extraction  and  residual  coding,  need  to  be  addressed  for 
the  successful  implementation  of  grayscale  symbolic  compression. 

9.2.1  Background  extraction 

Suppose  that  components  have  been  segmented  and  removed  from  a  grayscale  docu¬ 
ment  image.  The  resulting  image  is  a  “cut”  version  of  the  original  image  in  which  the 
pixel  values  have  been  set  to  0  where  components  were  detected  and  segmented.  The 
remaining  image  can  be  considered  to  be  mostly  background  and  can  be  analyzed 
to  determine  its  compressibility  index.  Comprehensive  background  removal  was  not 
attempted  since  it  would  involve  background  texture  determination  together  with 
segmentation,  clustering,  and  residual  coding.  However,  once  a  background  image  is 
determined  it  will  have  the  same  characteristics  as  the  cut  image. 

Figure  9.11  shows  the  distribution  of  the  background  pixel  values.  It  shows  that 
most  pixels  occur  near  the  lower  end  of  the  grayscale  spectrum.  In  this  particular 
example,  there  exist  no  pixels  above  grayscale  value  151  (on  a  scale  of  0  to  255).  The 
entropy  of  this  background  image  comes  to  5.5605  for  a  bit  rate  of  0.6951.  Lossy 
representation  of  the  background  can  be  achieved  in  a  number  of  ways,  including 
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Probability  distribution  of  background  pixels 


Figure  9.11:  Probability  distribution  function  of  the  background  pixels. 


resolution  reduction,  requantization  of  grayscale  values,  and  multi-scale  representa¬ 
tion,  among  others.  The  background  section  can  also  tolerate  a  much  higher  lossy 
representation  without  affecting  readability. 

9.2.2  Residual  coding 

Figure  9.12  shows  the  set  of  prototypes  assigned  to  the  clusters  of  grayscale  document 
image  components.  Instead  of  a  distance  transform,  we  use  a  gradient  transform.  This 
transform  favors  pixels  whose  neighborhoods  are  fairly  flat  over  those  that  have  higher 
gradients.  We  modify  our  hypothesis  that  informative  pixels  are  farther  from  edges 
to  a  hypothesis  that  the  informative  pixels  are  in  low-gradient  areas. 

Figure  9.13  shows  examples  of  a  segmented  component,  its  assigned  prototype, 
and  its  gradient  transform.  The  gradient  was  calculated  with  a  3  X  3  window.  It  can 
be  seen  that  the  edges  of  characters,  which  exhibit  the  highest  gradients,  are  shown 
as  having  darker  pixels.  The  residual  map  and  the  ordered  and  unordered  code  of  the 
example  in  Figure  9.13  are  shown  in  Figure  9.14.  It  is  easy  to  see  that  the  degradation 
effects  for  binary  images  continue  to  hold  for  grayscale  images. 

9.2.3  Entropy  analysis 

To  show  this  phenomenon  more  explicitly  we  can  estimate  the  distribution  of  the 
residual  pixels  and  the  gradient  values  to  arrive  at  an  entropy  value  for  the  ordered 
and  unordered  code  along  with  the  mutual  information  between  the  unordered  code 
and  the  distribution  of  the  gradient  map.  This  is  analogous  to  the  work  done  in 
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Figure  9.12:  Prototype  instances  of  a  grayscale  document  image. 


Figure  9.13:  Fxample  of  a)  grayscale  component,  b)  assigned  prototype,  and  c)  gra¬ 
dient  of  the  prototype. 


(a)  (b)  (c) 


Figure  9.14:  Fxample  of  a)  grayscale  residual,  b)  unordered  code,  and  c)  ordered 
code. 
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Figure  9.15:  Distribution  of  a)  the  unordered  pixels  and  b)  the  gradient  values. 


Chapter  5.  Recall  that  the  entropy  of  the  unordered  code  is 

255 

H{U)  =  -  E  M^)  X  log2(Pr/(n))  (9.f2) 

u=0 

where  Pu{u)  is  the  distribution  of  the  residual  pixels  associated  with  a  segmented 
component.  The  entropy  of  the  ordered  code  is 

H(0)  =  H(U\R) 

=  En[H{SR)] 

Rmax  255 

=  -  Y.  Ps,{‘)\og,{Ps,(s))  (9.13) 

r=0  s=0 

and  the  mutual  entropy  which  measures  the  improvement  in  coding  is 
I{U]R)  =  H{U)-Y{U\R) 

255  Rmax  TD  / 

=  i:  i:  ftlOftvWiog.Trrr 

s=0  r=0 

where  Rmax  is  the  largest  gradient,  analogous  to  the  largest  distance  in  the  case  of 
binary  images.  Figure  9.f5  shows  the  distribution  of  the  unordered  and  the  gradient 
values.  The  entropy  of  the  unordered  pixels  comes  to  0.8747  bits.  The  distribu¬ 
tion  of  the  unordered  residual  shows  artifacts  which  could  be  the  result  of  pixels 
occurring  at  different  gradient  levels.  The  gradient  distribution  of  Figure  9.f5  shows 
a  non-constant  distribution,  which  could  provide  better  ordering  and  coupled  with 
bit-allocation  of  residual  pixels  at  different  gradient  levels  could  give  desirable  rate- 
distortion  characteristics.  Figure  9.16  show  a  set  of  plots  which  reflect  the  distribution 
of  conditioned  residual  pixels.  The  distribution  varies  by  a  large  amount  and  promises 
to  provide  a  compact  coding. 
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Figure  9.17:  Entropy  of  U  given  R  =  \  for  r  =  f , Rmax- 

Figure  9.f  7  shows  the  entropy  of  pixels  for  each  gradient  value  r.  The  trend  of  this 
plot  is  not  as  expected.  We  expected  to  observe  lower  entropy  as  r  increases.  This 
implies  that  we  have  not  derived  a  sufficient  condition  R  to  apply  to  the  unordered 
code  U .  This  makes  sense  in  that  a  3  X  3  neighborhood  does  not  provide  an  accurate 
measure  of  how  residual  pixels  behave.  The  underlying  principle  should,  however, 
hold  in  that  edge  pixels  are  more  prone  to  degradation  and  pixels  far  from  edges  hold 
more  information  needed  for  correct  recognition.  In  our  example  we  measured  the 
entropy  of  the  unordered  code  to  be  0.8747,  and  of  the  ordered  code  to  be  0.8290,  for 
an  improvement  of  0.0457  which  translates  into  an  improvement  of  5  percent. 

We  are  conhdent  that  the  residual  coding  will  improve.  In  any  case,  the  developed 
infrastructure  has  a  wide  range  of  applicability  to  tasks  that  require  a  hierarchical 
representation.  As  was  mentioned  in  Chapter  5,  this  infrastructure  is  immediately 
applicable  to  problems  ranging  from  lossy  compression  and  transmission  to  multime¬ 
dia  signal  processing.  The  hierarchical  representation  can  contribute  in  two  ways. 
One  is  to  the  readability  of  the  document,  where  an  attempt  is  made  to  preserve  the 
character  structure,  much  as  it  was  preserved  for  binary  images.  The  second  is  to  the 
hdelity  of  the  image.  While  binarizing  the  document  would  renders  it  readable,  it  is 
much  more  esthetically  pleasing  to  use  shades  of  gray.  The  problem  is  to  determine 
the  best  combination  of  hdelity  and  readability.  For  degraded  documents  an  OCR 
engine  might  do  better  with  a  higher-hdelity  image,  and  a  balance  can  be  reached 
that  provides  the  best-recognizable  document.  However,  at  what  point  does  a  docu¬ 
ment’s  hdelity  affect  its  readability?  These  problems  still  need  to  be  addressed  for  a 
successful  implementation  of  grayscale  document  image  compression. 
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Figure  9.18:  Examples  of  graphic  images. 


9.3  Graphic  extension 

It  can  be  argued  that  graphical  diagrams  are  specialized  types  of  documents.  These 
images  do  not  fall  into  the  category  of  natural  images.  They  have  a  large  amount  of 
organization  at  several  levels,  and  they  are  composed  of  mostly  redundant  elements. 
Examples  of  graphic  images  are  shown  in  Figure  9.f8. 

The  most  redundant  constituent  in  these  images  is  a  straight  line  segment.  In 
determining  a  representative  prototype,  one  needs  to  develop  a  segmentation  method 
that  produces  sections  of  straight  lines  that  can  be  indexed  into  the  image  and  achieve 
compression.  This  may  be  hard  to  do  since  too-small  lines  will  reduce  compression  by 
specifying  too  many  location  indexes.  The  effects  of  the  residual  may  be  minimal,  but 
the  application  of  residual  coding  should  be  addressed  in  determining  an  appropriate 
segmentation.  Graphical  analysis  techniques  (a  review  of  some  algorithms  appears  in 
[8])  can  be  applied  toward  graphic  segmentation. 

9.4  Networked  databases 

Network  optimization  is  a  task  that  can  clearly  beneht  from  our  hierarchical  orga¬ 
nization.  It  is  true  that  transmission  capabilities  have  increased  by  a  large  amount, 
but  it  is  also  true  that  processing  power  has  increased.  This  means  that  while  we 
can  deliver  data  faster,  there  exist  processes  that  require  larger  amounts  of  data  and 
in  turn  require  higher  data  transfer  rates.  While  there  exist  fast  networks  that  do 
not  need  to  be  concerned  about  client-server  information  exchange,  there  exist  remote 
client-server  pairs  that  do  not  have  optimized  traffic  and  are  constantly  competing  for 
network  bandwidth.  In  the  age  of  the  world  wide  web,  central  database  servers,  and 
countless  modem  connections,  many  users  are  confronted  with  slow  network  speeds. 
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Figure  9.19:  Network  utilization  for  server-client  transmissions. 

Figure  9.19  depicts  a  server-client  network  where  at  a  particular  time  multiple 
sockets  are  busy  transferring  data.  If  uniform  network  resources  are  allocated,  the 
informative  prototype  shapes  and  their  locations  are  weighted  equally  with  the  resid¬ 
ual  pixels  (if  any),  without  any  regard  to  total  amount  of  service  provided  to  a  group 
of  users.  Service  can  be  dehned  by  the  amount  of  readable  information  transferred 
and  a  priority  assignment  can  easily  be  devised  so  that  network  traffic  is  optimized 
in  the  sense  of  providing  the  best  service  to  the  greatest  number  of  users. 

This  approach  can  easily  be  extended  to  packet-transport  networks  and  ATM  net¬ 
works  that  use  quality-of-service  indexes.  A  number  of  concerns  must  be  addressed 
when  faced  with  these  issues.  While  textured  images,  video,  and  sound  have  tradi¬ 
tionally  been  considered  in  connection  with  these  problems,  document  images  have 
not,  and  need  to  be,  addressed  [51]. 

9.5  Hyperdocument  organization  and  transmission 

The  usage  of  documents  has  become  very  flexible  in  recent  years.  While  it  has  been 
shown  that  documents  hold  information  at  various  levels  [24]  that  must  be  managed 
and  retrieved,  other  media  domains  have  also  been  integrated  into  their  structure. 
Document  processing  with  multimedia  tools  have  been  studied  [102]  but  new  charac- 
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teristics,  beyond  synthetic  langnages  like  SGML  or  definitions  like  MHEG,  need  to 
be  addressed  [92], 

Hyperdocnments  offer  a  nniqne  extension  to  the  traditional  fnnctionality  of  a 
document  by  extending  the  semantic  meaning  of  docnment  objects  to  inclnde  mnlti- 
media  objects.  This  is  different  from  a  synthetic  marknp  langnage  in  that  a  physical 
docnment  is  nsed  and  referenced,  while  cross-media  objects,  from  physical  sonrces, 
are  anchored  to  mark  np  the  docnment. 

The  implications  of  this  degree  of  freedom  are  great.  For  instance,  consider  a 
held  mannal,  originally  derived  from  text  mannals  having  a  high  level  of  strnctnre 
(title,  paragraphs,  graphs,  annotated  text  and  images).  Field  technicians  can  nse 
a  hyperdocnment  interface  tool  to  annotate  the  docnment  in  varions  media  (video, 
andio,  and/or  data  snch  as  “We  have  observed  new  cracks  in  the  secondary  valves 
...”)  and  design  engineers  can  nse  a  hyperdocnment  management  and  browsing  tool 
to  gain  access  to  sections  of  the  held  mannal. 

Cooperative  learning,  reviewing,  and  project  management  conld  all  beneht  from 
this  framework.  The  basic  idea  here  is  that  there  exist  a  single  physical  based  docn¬ 
ment  which  is  nsed  by  mnltiple  sonrces.  By  annotating  the  docnment  appropriately, 
conferencing  occnrs  which  allows  for  a  hnent  dissemination  of  information  for  the 
docnment  readers.  For  media  distribntion,  physical  media  types  can  easily  be  inte¬ 
grated,  snch  as  assembly  procednres,  to  be  delivered  to  cnstomers  or  co-workers.  The 
nse  of  this  concept  provides  a  logical  connection  from  a  physical  media  domain  to  a 
mnlti-media  marknp  domain.  In  achieving  that  goal,  a  large  nnmber  of  issnes  need 
to  be  addressed  [92]. 

9.6  Summary 

We  have  developed  a  methodology  for  detecting  dnplicate  docnment  images  in  a  het- 
erogeneons  database.  This  can  aid  in  maintaining  docnment  image  databases.  We 
are  able  to  perform  extremely  fast  dnplicate  detection  on  millions  of  docnments.  The 
hrst  scan  is  able  to  rednce  the  possible  dnplicates  from  millions  to  tens  or  hnndreds 
of  docnments.  A  hner,  bnt  more  compntationally  intensive,  dnplicate  detection  al¬ 
gorithm  was  also  snggested  which  can  be  nsed  in  redncing  possible  dnplicates  from 
hnndreds  to  tens  or  lower  by  nse  of  a  symbolic  approach.  A  hnal  redaction  can  be 
done  by  comparing  OCR  resnlts. 

Several  extensions  to  onr  base  image  compression  effort  have  been  snggested. 
The  hrst  natnral  extension  of  binary  docnment  images  is  grayscale  images.  Some 
preliminary  resnlts  show  that  some  amonnt  of  improvement  in  coding  can  be  obtained; 
however,  there  exist  fnndamental  concerns  regarding  the  readability  and  hdelity  of  the 
image  and  how  they  interact.  After  these  concerns  are  addressed,  the  performance 
of  lossy  compression  and  progressive  transmission  needs  to  be  assessed  along  with 
efhciency  of  coding  for  docnment  processing,  e.g.  keyword  searching,  skew  estimation 
and  correction,  partial  retrieval,  etc.  Once  the  grayscale  extension  is  accomplished, 
the  farther  extension  to  color  images  can  be  addressed.  Another  logical  extension  is  to 
the  compression  of  images  which  are  composed  of  patterns  that  are  connected.  In  onr 
compression  scheme  we  took  a  connected  component  approach  since  the  patterns  in 
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the  image  appeared  as  disconnected  components.  However,  graphic  images,  as  well  as 
documents  in  langnages  that  are  composed  of  connected  characters,  have  constitnent 
patterns  that  appear  continnonsly.  For  a  snccessfnl  implementation  of  onr  approach 
for  these  image  types,  an  alternative  segmentation  method  needs  to  be  nsed. 

Representation  and  coding  of  images  are  only  parts  of  a  comprehensive  solntion 
to  the  class  of  information  archiving  and  management  problems.  The  transmission  of 
information  is  an  inherent  part  of  information  dissemination  and  needs  to  be  addressed 
for  efficient  nsage  of  limited  resonrces.  Often,  resonrce  allocation  and  nsage  are  liberal, 
assnming  nnlimited  resonrces.  These  practices  fail  to  perform  well  in  cases  where  the 
resonrces  per  nser  are  limited.  Technological  advances  have  made  optimized  resonrce 
allocation  possible  and  need  to  be  addressed  for  network  resonrces. 

Finally,  integration  of  other  media  into  a  document’s  strnctnre  was  snggested. 
With  technological  advances  bringing  massive  amonnts  of  information  to  nsers,  intel¬ 
ligent  nse  and  management  of  varions  media  is  desired.  This  goes  beyond  the  conhnes 
of  a  marknp  langnage  and  widens  the  nsage  of  classical  docnments.  When  a  classi¬ 
cal  docnment  image  is  viewed,  a  standard  organization  exists,  snch  as  title,  anthor, 
paragraphs,  images,  footnotes,  etc.  This  is  easily  extended  to  inclnde  other  media 
domains.  Organization  of  these  components  has  been  addressed  to  some  extent  [24] 
bnt  fnndamental  problems  in  compression,  transmission,  and  processing  need  to  be 
addressed. 
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Chapter  10 


Conclusion  and  future  work 


Computer  systems  have  made  significant  contributions  to  the  managagement  and 
transmission  of  images  in  many  forms.  Transmission  and  processing  remains  an  ac¬ 
tive  research  probfem  in  many  image  domains.  In  an  attempt  to  optimize  retrieval 
of  document  images,  it  was  observed  that  document  images  contain  most  of  their 
meaningful  information  within  their  components.  Since  the  components  are  samples 
of  a  specihc  language,  redundancy  in  their  occurrence  can  be  exploited  to  provide  a 
highly  efficient  representation.  We  developed  a  technique  which  extracted  constituent 
components,  clustered  them,  and  represented  them  in  a  compressed  form.  While  this 
technique  was  developed  independently,  it  signihcantly  extends  the  technique  pro¬ 
posed  by  Witten  et  al.  [HO],  and  took  its  roots  from  work  done  by  Ascher  and  Nagy 

[9]. 

Traditional  coding,  transmission,  and  lossy  representation  of  images  often  made 
use  of  resolution- reduction  practices.  This  was  ideal  for  textured  images  but  document 
images  do  not  ht  easily  into  that  class.  Document  images  lose  their  meaningful  hne 
structure  much  faster  than  textured  images  when  resolution  is  lowered.  In  the  process 
of  replacing  an  image  component  by  the  prototype  of  a  representative  cluster,  we  do 
not  reduce  resolution;  instead,  we  reduce  the  allowable  level  of  variability  between 
image  components.  This  effectively  produces  a  lossy  representation  which  is  stiff 
readable  and  processable. 

10.1  Summary 

We  hrst  devised  a  data  representation  scheme  that  formulated  the  constituent  parts 
of  our  algorithm.  This  representation  was  designed  to  not  only  provide  compression, 
but  also  facilitate  processing  while  in  the  compressed  domain.  Since  the  task  of  de¬ 
compression  is  more  computationally  intensive  than  that  of  retrieval,  we  decomposed 
the  representation  into  logical  building  blocks  and  introduced  indexes  for  large  blocks 
which  beneht  from  partial  decompression.  The  design  of  this  representation  helps  in 
the  identihcation  and  construction  of  various  processing  modules  that  enhance  and 
augment  the  functionality  of  the  compression.  In  addition  to  building  the  compres¬ 
sion/decompression  routines,  we  were  able  to  conclude  that  component  segmentation 
and  clustering  could  beneht  from  joint  processing  and  that  eventually  a  global  oper¬ 
ations  manager  could  optimize  the  overall  representation. 

We  considered  a  number  of  clustering  algorithms  for  incorporation  into  the  above 
model.  Due  to  the  size  and  population  variability  of  components,  several  well-known 
clustering  algorithms  were  dismissed  as  having  undesirable  computational  properties. 
We  then  considered  a  matching  and  classifying  approach  which  had  desirable  proper- 


115 


ties  and  also  provided  the  freedom  of  choosing  a  matching  fnnction  that  attempts  to 
preserve  some  component  properties.  We  considered  the  nse  of  Hamming  distance, 
weighted  Hamming  distance,  and  weighted  AND-NOT  matching  fnnctions  for  nse 
in  onr  clnstering  algorithm.  We  also  considered  entropy-based  fnnctions,  and  sng- 
gested  a  distance-based  matching  method  that  preserved  component  strnctnres.  This 
yielded  a  clnstering  algorithm  that  gronped  components  of  similar  strnctnre.  We 
argne  that  strnctnral  components  contribnte  the  most  toward  the  correct  recogni¬ 
tion  of  characters.  We  also  improved  on  the  clnstering  methodology  by  considering  a 
joint  segmentation  and  clnstering  algorithm.  This  has  proven  to  be  nsefnl  for  images 
which  contain  large  nnmbers  of  connected  characters.  The  method  starts  with  the 
connected  components  of  an  image  and  derives  a  set  of  prototypes.  The  prototypes 
are  then  nsed  to  revisit  all  the  components  and  determine  an  alternate  segmentation 
strategy  that  attempts  to  segment  characters  which  are  joined  together.  A  simple 
feedback  constrnction  allows  for  effective  calcnlation  of  meaningfnl  prototypes  which 
can  improve  compressed-domain  processing. 

Residnal  coding  constitntes  a  major  portion  of  the  image  coding.  We  developed 
a  novel  approach  to  residnal  coding  with  desirable  compression  and  transmission 
characteristics.  Onr  original  method  was  to  provide  a  hierarchical  representation  in 
which  the  image  representation  based  on  only  the  prototypes  dehned  the  baseline. 
Any  snbseqnent  information  was  considered  to  be  less  important  and  farther  down 
in  the  hierarchy.  Since  residnal  coding  was  necessary  for  effective  representation  of 
the  image  (withont  it,  the  image  might  be  misinterpreted)  we  proposed  a  hierarchical 
ordering  of  residnal  information  to  allow  for  higher  compression  and  desirable  lossy 
representation.  The  basis  of  this  code  is  the  fact  that  all  pixels  are  not  created  eqnal; 
while  some  pixels  contribnte  to  correct  recognition  of  a  component,  other  pixels  do 
not.  We  pnrsned  a  distance-based  ordering  which  favored  more  informative  pixels 
and  attempted  to  preserve  character  strnctnre  for  lossy  compression.  Two  other 
strnctnral  coding  methods  were  also  snggested  which  provided  enhancements  to  the 
distance  ordering  by  observing  that  strnctnral  components  consist  of  gronpings  of 
pixels  and  that  the  gronpings  are  most  likely  connected  to  the  prototype.  Favorable 
compression  performance  was  shown  with  desirable  lossy  representation. 

A  rate-distortion  analysis  was  performed  with  a  measnre  of  distortion  based  on 
OCR  and  distance-based  matching  performance.  The  OCR  performance,  with  respect 
to  varions  achievable  rates,  showed  that  distance  ordering  is  clearly  more  favorable 
than  row  ordering.  We  experimented  with  varions  intermediate  representations  nsing 
fractional  residnal  information  and  observed  the  performance  of  the  OCR  engine  with 
respect  to  the  available  gronnd  trnth  data.  The  performance  of  the  OCR  engine  at 
the  highest  levels  of  lossy  compression  was  very  high  (80%),  gnaranteeing  that  the 
OCR  engine  was  operating  nnder  near-peak  conditions,  and  the  performance  was 
observed  to  increase  to  its  maximnm  (95%)  when  all  of  the  residnal  information  was 
represented.  We  also  considered  a  rate-distortion  tradeoff  nsing  the  distance-based 
match  as  a  measnre  of  distortion.  This  allows  for  antomatic  generation  of  a  lossy 
representation  that  gives  optimally  low  distortion  based  on  the  achievable  entropy 
rate. 

A  set  of  applications  that  ntilize  onr  data  representation  was  then  presented. 
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The  basic  compression  and  decompression  routine  formed  the  basis  for  these  applica¬ 
tions.  We  demonstrated  successful  implementation  of  lossy  compression,  progressive 
transmission,  sub-document  retrieval,  skew  estimation  and  correction,  and  keyword 
searching.  Scalable  processing  was  demonstrated  for  lossy  compression,  progressive 
transmission,  and  sub-document  retrieval,  and  compressed-domain  processing  was 
exhibited  for  skew  estimation  and  correction  and  keyword  searching.  We  also  used 
information  about  character  shapes  to  perform  duplicate  detection  of  documents  in 
large  databases. 

Due  to  the  large  number  of  immediate  extensions  to  our  work,  a  full  section  was 
devoted  to  this  topic.  Preliminary  results  on  some  concepts  showed  promise  for  the 
success  of  future  developments.  Our  basic  methodology  appears  to  be  applicable  to 
other  common  image  and  networking  problems. 

10.2  Future  work 

Immediate  work  should  emphasize  concepts  discussed  under  extensions  (Chapter  9), 
especially  those  involving  grayscale  images  and  graphic  images.  It  is  envisioned  that 
in  the  immediate  future,  these  extensions  will  have  the  most  impact.  Extensions  to 
network  databases  are  also  interesting  and  should  be  pursued,  but  only  in  conjunction 
with  research  on  multi-media  databases,  and  only  if  justihed  by  market  demand.  In 
the  long  term,  hypermedia  extensions  will  prove  to  be  a  valuable  research  topic  with 
a  large  number  of  market  applications. 

In  other  areas  not  yet  mentioned,  integration  of  wavelet  decomposition  will  prove 
to  be  valuable.  Research  on  binary  wavelet  decomposition  has  given  good  results, 
but  in  the  realm  of  display  technology  rather  than  compression,  transmission,  and 
processing  technologies.  The  idea  is  that  if  a  screen  of  low  resolution  is  used  to  render 
an  image,  it  does  not  seem  efficient  to  transmit  a  high-resolution  image.  The  problem 
is  then  to  decompose  the  image  into  desired  scales  and  transmit  based  on  those  scales. 
Since  binary  wavelets  are  rare  and  their  bit  allocation  is  a  hard  problem,  their  use 
in  binary  image  compression  was  not  pursued.  However,  for  grayscale  images,  there 
may  exist  sets  of  wavelets  that  are  tunable  based  on  prototypes  to  create  the  best 
scalable  decomposition  through  a  resolution  path.  The  idea  of  wavelet  decomposition 
should  be  reexamined  for  the  case  of  grayscale  images. 

Integration  with  MG  (Managing  Gigabyte),  and  creation  of  a  code  that  can  achieve 
low  entropy  will  prove  to  be  extremely  benehcial.  Integration  with  MG  should  provide 
higher  compression;  it  will  also  add  scalable  lossy  and  progressive  representation, 
which  MG  currently  lacks.  It  will  also  promote  compressed-domain  processing  using 
its  representation.  Performing  compressed-domain  processing  relies  on  performing 
indexable  compression.  A  coding  method  needs  to  be  pursued  that  can  achieve  close 
to  the  entropies  mentioned  in  Chapter  5  while  allowing  for  partial  decompression.  It 
is  also  desirable  to  devise  coding  methods  that  will  allow  for  scalable  error  coding, 
for  eventual  use  in  source- channel  coding. 

After  studying  the  symbolic  approach  and  its  rate-distortion  tradeoff,  we  can 
predict  that  distance  ordering  and  structural  decomposition  may  provide  an  approach 
to  document  image  enhancement  or  to  determining  image  quality  prior  to  performing 
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any  processing.  Since  we  have  nsed  a  degradation  model  which  relates  well  to  correct 
recognition  (OCR-based  rate-distortion),  we  may  extend  this  model  to  either  alter  the 
image  for  better  recognition  or  calcnlate  a  qnality  index.  While  the  nse  of  compressed- 
domain  processing  in  OCR  applications  is  logical,  it  is  also  logical  to  have  bnilt-in 
OCR  correction  mechanisms  based  on  onr  representation.  This  will  angment  the 
standard  word  dictionary  or  other  lexicon-based  checking. 

The  hnal  extension  of  onr  method  is  envisioned  to  address  compression  in  other 
image  domains  snch  as  satellite  or  aerial  images,  biomedical  images,  and  so  on.  The 
concepts  of  segmentation,  clnstering,  and  residnal  coding  need  to  be  addressed  for 
these  domains;  this  may  lead  to  signihcant  progress  in  the  compression  and  processing 
of  general  images. 
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